Hi Gang Thank you for looking into this. Updating the description on parquet.apache.org will save everyone searching for this information a few hours of head scratching. It is unfortunate that the slightly out-of-date spec features more prominently in Google results.
Kind regards Kaili ________________________________ From: Gang Wu <ust...@gmail.com> Sent: Tuesday, January 9, 2024 5:56 PM To: dev@parquet.apache.org <dev@parquet.apache.org> Subject: Re: Discrepancy in parquet format documentation Hi Kaili, You're right. Please refer to the parquet-format repo for specs. The site is unfortunately out of sync for a long time and there isn't any automatic process to update it. Let me update the site manually to be in sync with the latest format release. Best, Gang On Sun, Jan 7, 2024 at 8:03 AM Kaili Zhang <kaili...@hotmail.com> wrote: > Hi all > > I found this page via Google when searching for a description of the > parquet binary format: > https://parquet.apache.org/docs/file-format/data-pages/. This page > suggests that definition levels are written before repetition levels. > > However, after experimenting with parquet files generated by pandas and > pyarrow and perusing the arrow source code (especially > InitializeLevelDecoders in > https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc), > I strongly believe that repetition levels are written before definition > levels. I also found this other documentation of parquet format that has > repetition levels before definition levels > https://github.com/apache/parquet-format. > > The content of the parquet.apache.org/docs site appears to be tracked on > Github under https://github.com/apache/parquet-site. Is the documentation > content still being actively updated? Has there been an effort to > synchronize the format descriptions under apache/parquet-site with those > under apache/parquet-format? > > Kind regards > > Kaili > >