Hey Gang, Kaili,

I think the easiest way to solve this issue is to completely remove the
spec from the site and add a reference to the parquet-format repo instead.
We should probably add the release tag links when we make a release of
parquet-format with a "latest" link. This way we would also avoid potential
issues when someone would make decisions based on un-released spec changes.

Cheers,
Gabor

Kaili Zhang <kaili...@hotmail.com> ezt írta (időpont: 2024. jan. 13., Szo,
20:53):

> Hi Gang
>
> Thank you for looking into this. Updating the description on
> parquet.apache.org will save everyone searching for this information a
> few hours of head scratching. It is unfortunate that the slightly
> out-of-date spec features more prominently in Google results.
>
> Kind regards
>
> Kaili
> ________________________________
> From: Gang Wu <ust...@gmail.com>
> Sent: Tuesday, January 9, 2024 5:56 PM
> To: dev@parquet.apache.org <dev@parquet.apache.org>
> Subject: Re: Discrepancy in parquet format documentation
>
> Hi Kaili,
>
> You're right. Please refer to the parquet-format repo for specs. The site
> is unfortunately out of sync for a long time and there isn't any automatic
> process to update it. Let me update the site manually to be in sync with
> the latest format release.
>
> Best,
> Gang
>
> On Sun, Jan 7, 2024 at 8:03 AM Kaili Zhang <kaili...@hotmail.com> wrote:
>
> > Hi all
> >
> > I found this page via Google when searching for a description of the
> > parquet binary format:
> > https://parquet.apache.org/docs/file-format/data-pages/. This page
> > suggests that definition levels are written before repetition levels.
> >
> > However, after experimenting with parquet files generated by pandas and
> > pyarrow and perusing the arrow source code (especially
> > InitializeLevelDecoders in
> >
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc
> ),
> > I strongly believe that repetition levels are written before definition
> > levels. I also found this other documentation of parquet format that has
> > repetition levels before definition levels
> > https://github.com/apache/parquet-format.
> >
> > The content of the parquet.apache.org/docs site appears to be tracked on
> > Github under https://github.com/apache/parquet-site. Is the
> documentation
> > content still being actively updated? Has there been an effort to
> > synchronize the format descriptions under apache/parquet-site with those
> > under apache/parquet-format?
> >
> > Kind regards
> >
> > Kaili
> >
> >
>

Reply via email to