Hey Gang, Kaili, I think the easiest way to solve this issue is to completely remove the spec from the site and add a reference to the parquet-format repo instead. We should probably add the release tag links when we make a release of parquet-format with a "latest" link. This way we would also avoid potential issues when someone would make decisions based on un-released spec changes.
Cheers, Gabor Kaili Zhang <kaili...@hotmail.com> ezt írta (időpont: 2024. jan. 13., Szo, 20:53): > Hi Gang > > Thank you for looking into this. Updating the description on > parquet.apache.org will save everyone searching for this information a > few hours of head scratching. It is unfortunate that the slightly > out-of-date spec features more prominently in Google results. > > Kind regards > > Kaili > ________________________________ > From: Gang Wu <ust...@gmail.com> > Sent: Tuesday, January 9, 2024 5:56 PM > To: dev@parquet.apache.org <dev@parquet.apache.org> > Subject: Re: Discrepancy in parquet format documentation > > Hi Kaili, > > You're right. Please refer to the parquet-format repo for specs. The site > is unfortunately out of sync for a long time and there isn't any automatic > process to update it. Let me update the site manually to be in sync with > the latest format release. > > Best, > Gang > > On Sun, Jan 7, 2024 at 8:03 AM Kaili Zhang <kaili...@hotmail.com> wrote: > > > Hi all > > > > I found this page via Google when searching for a description of the > > parquet binary format: > > https://parquet.apache.org/docs/file-format/data-pages/. This page > > suggests that definition levels are written before repetition levels. > > > > However, after experimenting with parquet files generated by pandas and > > pyarrow and perusing the arrow source code (especially > > InitializeLevelDecoders in > > > https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc > ), > > I strongly believe that repetition levels are written before definition > > levels. I also found this other documentation of parquet format that has > > repetition levels before definition levels > > https://github.com/apache/parquet-format. > > > > The content of the parquet.apache.org/docs site appears to be tracked on > > Github under https://github.com/apache/parquet-site. Is the > documentation > > content still being actively updated? Has there been an effort to > > synchronize the format descriptions under apache/parquet-site with those > > under apache/parquet-format? > > > > Kind regards > > > > Kaili > > > > >