Hi Wes, As far as I remember hive, spark, impala, duckdb or even proprietary systems like hyper, Vertica all support reading data page v2 now. The most recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall the support seems much better than a year or two ago.
Best regards, Adam Lippai On Wed, Apr 24, 2024 at 10:51 Wes McKinney <wesmck...@gmail.com> wrote: > I think there is confusion about the Parquet "V2" (including the V2 data > pages, and other details) and the 2.x.y releases of the format library > artifact. They aren't the same unfortunately. I don't think the V2 metadata > structures (the data pages in particular, and new column encoding) is > widely adopted / readable. > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <weston.p...@gmail.com> wrote: > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it is > not > > > official . They are advising not to use Parquet V2 for writing (though > > code > > > is available ) .* > > > > This would be news to me. Parquet releases are listed (by the parquet > > community) at [1] > > > > The vote to release parquet 2.10 is here: [2] > > > > Neither of these links mention anything about this being an experimental, > > unofficial, or non-finalized release. > > > > I understand your concern. I believe your quotes are coming from your > > discussion on the parquet mailing list here [3]. This communication is > > unfortunate and confusing to me as well. > > > > [1] https://parquet.apache.org/blog/ > > [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6 > > [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3 > > > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com> wrote: > > > > > Hello Jacob, > > > Thanks for the information, and my apologies for the weird format of my > > > email. > > > > > > This is the email from the Parquet community. May I know why pyarrow is > > > using Parquet V2 which is not official yet ? > > > > > > My question is from Parquet community V2 is not final yet so it is not > > > official yet. > > > "Hi Prem - Maybe I can help clarify to the best of my knowledge. > Parquet > > V2 > > > as a standard isn't finalized just yet. Meaning there is no formal, > > > *finalized* "contract" that specifies what it means to write data in > the > > V2 > > > version. The discussions/conversations about what the final V2 standard > > may > > > be are still in progress and are evolving. > > > > > > That being said, because V2 code does exist (though unfinalized), there > > are > > > clients / tools that are writing data in the un-finalized V2 format, as > > > seems to be the case with Dremio. > > > > > > Now, as that comment you quoted said, you can have Spark write V2 > files, > > > but it's worth being mindful about the fact that V2 is a moving target > > and > > > can (and likely will) change. You can overwrite parquet.writer.version > to > > > specify your desired version, but it can be dangerous to produce data > in > > a > > > moving-target format. For example, let's say you write a bunch of data > in > > > Parquet V2, and then the community decides to make a breaking change > > (which > > > is completely fine / allowed since V2 isn't finalized). You are now > left > > > having to deal with a potentially large and complicated file format > > update. > > > That's why it's not recommended to write files in parquet v2 just yet." > > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it is > not > > > official . They are advising not to use Parquet V2 for writing (though > > code > > > is available ) .* > > > > > > > > > *As per above Spark hasn't started using Parquet V2 for writing *. > > > > > > May I know how an unstable /unofficial version is being used in > pyarrow > > ? > > > > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <assignu...@apache.org> > > > wrote: > > > > > > > Hello, > > > > > > > > First off, please try to clean up formating of emails to be legible > > when > > > > forwarding/quoting previous messages multiple times, especially when > > most > > > > of the quotes do not contain any useful information. It makes it much > > > > easier to parse the message and thus quicker to answer. > > > > > > > > The short answer is that we switched to 2.4 and more recently to 2.6 > as > > > > the default to enable the usage of features these versions provide. > As > > > you > > > > have correctly quoted from the docs you can still write 1.0 if you > want > > > to > > > > ensure compatibility with systems that can not process the 'newer' > > > versions > > > > yet (2.6 was released in 2018!). > > > > > > > > You can find the long form discussions about these changes here: > > > > https://issues.apache.org/jira/browse/ARROW-12203 > > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > > > > > > > > Best > > > > Jacob > > > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote: > > > > > Hello Team, > > > > > Could you please share your thoughts about below questions? > > > > > Sent from my iPhone > > > > > > > > > > Begin forwarded message: > > > > > > > > > > > From: Prem Sahoo <prem.re...@gmail.com> > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT > > > > > > To: dev-ow...@arrow.apache.org > > > > > > Subject: Re: PyArrow Using Parquet V2 > > > > > > > > > > > > dev@arrow.apache.org > > > > > > Sent from my iPhone > > > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <prem.re...@gmail.com> > > > > wrote: > > > > > >>> > > > > > >> Hello Team, > > > > > >> Could anyone please help me on below query? > > > > > >> Sent from my iPhone > > > > > >> > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo < > prem.re...@gmail.com> > > > > wrote: > > > > > >>>> > > > > > >>> > > > > > >>> Sent from my iPhone > > > > > >>> > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo < > prem.re...@gmail.com> > > > > wrote: > > > > > >>>>> > > > > > >>>> > > > > > >>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> Hello Team, > > > > > >>>>> I have a question regarding Parquet V2 writing thro pyarrow . > > > > > >>>>> As per below Pyarrow started writing Parquet in V2 encoding. > > > > > >>>>> > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > > > > >>>>> > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > > > > >>>>> Determine which Parquet logical types are available for use, > > > > whether the reduced set from the Parquet 1.x.x format or the expanded > > > > logical types added in later format versions. Files written with > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet > > > implementations, > > > > so version=’1.0’ is likely the choice that maximizes file > > compatibility. > > > > UINT32 and some logical types are only available with version ‘2.4’. > > > > Nanosecond timestamps are only available with version ‘2.6’. Other > > > features > > > > such as compression algorithms or the new serialized data page format > > > must > > > > be enabled separately (see ‘compression’ and ‘data_page_version’). > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not final yet > so > > it > > > > is not official . They are advising not to use Parquet V2 for writing > > > > (though code is available ) . > > > > > >>>>> > > > > > >>>>> As per above Spark hasn't started using Parquet V2 for > writing > > . > > > > > >>>>> May I know how an unstable /unofficial version is being used > > in > > > > pyarrow ? > > > > > >>>>> > > > > > > > > > > > > > > >