Re: [DISCUSS] Parquet 1.14.0 and looking forward

Fokko Driesprong Tue, 27 Feb 2024 06:42:58 -0800

Hey everyone,

Thanks for the many responses.


We should check that parquet-mr implements everything introduced by the new
> parquet-format release.


Good call and I fully agree with that. Let's double check that before
starting any releases.

We should check on every ongoing PRs and Jira's that seem to be targeting
> the next parquet-mr release, and decide if we want to wait for them or
> not.


I'm happy to do a first pass on that.

I am currently doing some work related to direct memory. Not all the related
> jiras are created. Will try to create them and set 1.14.0 as target. Will
> try to finalize everything by the end of next week.


Thanks, it is not my main area of expertise, but let me know if you need a
review. I would not want to rush the release if there is still ongoing
work, just wanted to get the ball rolling and collect expectations.

For the new API, I feel like we're doing a 1.15 and then jump to 2.0, which
is also totally fine with me.

For who's there, see you at the sync!

Kind regards,
Fokko Driesprong

Op do 22 feb 2024 om 13:47 schreef Steve Loughran
<ste...@cloudera.com.invalid>:

> Apologies for not making any progress -been too busy with releases.
>
> This week I am helping Hadoop 3.4.0 out the door. Hopefully we will only
> need one more iteration to get the packaging right (essentially strip out
> as many transient JARs as we can). My release module does actually build
> parquet as one stage in the validation, so I'm happy we aren't breaking
> your build.
>
> Moving to 3.3+ would be absolutely wonderful; it has been out for years and
> we have fixed many issues as well as done our best to move to less insecure
> transitive dependencies -that is still ongoing. It is ongoing forever I
> suspect.
>
> Unless you use a release with vector IO (3.3.5+) you'll still need to use
> reflection there.
>
> What you will get as soon as you move to 3.3.0 is the openFile() API which
> lets you
> Explicitly declare the read/seek policy of a file. For parquet, "random" is
> what you want.
> Pass in the filestatus or file length when opening a file. For object
> stores, that can save the overhead of an HTTP HEAD request as we can skip
> the probe for the existence and length of the file.
>
> Random IO is the biggest saving here; s3a FS tries to guess your read
> policy and switch to random on the first backwards seek, but it isn't
> perfect.
>
> Regarding vectored read APIs, the Hadoop one maps trivially to the java nio
> scatter/gather read API. Which can deliver great speed ups on native
> storage, especially SSD -more from the ability to do parallel block reads
> than anything else. What does that mean? use the hadoop raw local fS and
> you get it. It also means that any non-hadoop java code should use the nio
> read API directly.
>
> Anyway: I do plan to get onto that PR request as soon as I get a chance.
> - add range overlap detection in the parquet code
> - make sure all hadoop filesystem reject that too. s3a already does AFAIK,
> but I want consistency, contract tests and coverage in the specification.
>
>
> On Wed, 21 Feb 2024 at 15:30, Gang Wu <ust...@gmail.com> wrote:
>
> > Hi,
> >
> > Thanks for bringing this up!
> >
> > For the 1.14.0 release, I think it would be good to include some open
> > PRs, e.g. [1].
> >
> > Thanks Gabor for the idea of new APIs! I agree that we need to clean
> > up some misused APIs and remove the Hadoop dependencies. In the
> > meanwhile, I actually have some concerns. For example, recently I have
> > just investigated how ApacheSpark and Apache Iceberg support vectorized
> > reading parquet. I have seen many code duplication between them but
> > have different high-level APIs. If we aim to support similar vectorized
> > reader based on Arrow vectors, I am not sure if these clients are willing
> > to
> > migrate due to the difference in type system, performance of vector
> > conversion, etc. That said, this is worth doing and we need to collect
> > sufficient feedback from different communities.
> >
> > [1] https://github.com/apache/parquet-mr/pull/1139
> >
> > Best,
> > Gang
> >
> > On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky <ga...@apache.org>
> wrote:
> >
> > > Thanks for bringing this up, Fokko.
> > > Unfortunately, I won't be able to join next week. (Hopefully I will be
> > > there at the one after.)
> > > So, let me write my thoughts here.
> > >
> > > I agree it is time to start preparing the next parquet-mr release. I
> have
> > > some thoughts:
> > > - We should check that parquet-mr implements everything introduced by
> the
> > > new parquet-format release
> > > - We should check on every ongoing PRs and jiras that seem to be
> > targeting
> > > the next parquet-mr release, and decide if we want to wait for them or
> > not
> > > - I am currently doing some work related to direct memory. Not all the
> > > related jiras are created. Will try to create them and set 1.14.0 as
> > > target. Will try to finalize everything by the end of next week.
> > >
> > > About parquet-mr 2.0: we need to decide what we expect from it. The
> java
> > > upgrade is just one thing that even can be done without a major version
> > > (e.g. separate releases for different java versions)
> > > My original thoughts about 2.0 was to provide a new API for our clients
> > >
> > > - We've had many issues because different API users started using
> > > classes/methods that were originally implemented for internal use only.
> > > Like reading the pages directly.
> > >
> > > - We need to have different levels of APIs that support all current
> > > use-cases. e.g.:
> > >
> > > - Easy to use high level row-wise reading/writing
> > >
> > > - vectorized reading/writing; probably native support of Arrow vectors
> > >
> > > - We need to get rid of the Hadoop dependencies
> > >
> > > - The goal is to have a well-defined public API that we share with our
> > > clients and hide everything else. It is much easier to keep backward
> > > compatibility for the public API only.
> > >
> > > - The new API itself does not need a major release. We can start
> working
> > on
> > > it in a separate module. We'll need some minor release cycles to build
> > it.
> > > (We'll need our client's feedback.) What we need a major release for is
> > > (after having the finalized new API) moving all current public classes
> to
> > > internal modules.
> > >
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > >
> > > Fokko Driesprong <fo...@apache.org> ezt írta (időpont: 2024. febr.
> 21.,
> > > Sze, 13:04):
> > >
> > > > Hi everyone,
> > > >
> > > > I'm seeing some great progress on the Parquet side and it was almost
> > one
> > > > year ago that I ran the last 1.13.1 release (May 2023). Are there any
> > > > considerations of doing a 1.14.0 release?
> > > >
> > > > Looking forward, I would like to discuss a Parquet-mr 2.0 release.
> > > >
> > > >    - Looking at other projects in the space there are more and more
> > that
> > > >    are moving to Java 11+, for example, Spark 4.0 (June 2024) and
> > Iceberg
> > > > 2.0
> > > >    (the first release after 1.5.0 that's being voted on right now).
> > > >    - We currently have support for Hadoop 2.x which is compiled
> against
> > > >    Java 7. I would suggest dropping everything below 3.3 as that's
> the
> > > > minimal
> > > >    version supporting Java 11
> > > >    <
> > > >
> > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
> > > >.
> > > >    Because some APIs changed, we also have to use reflection, which
> is
> > > not
> > > >    great.
> > > >
> > > > I would also like to thank Xinli for updating the Parquet Sync
> invite.
> > I
> > > > was there on the 30th of January, but all by myself. The next sync
> next
> > > > week Tuesday would be a great opportunity to go over this topic.
> > > >
> > > > Looking forward to your thoughts!
> > > >
> > > > Kind regards,
> > > > Fokko Driesprong
> > > >
> > >
> >
>

Re: [DISCUSS] Parquet 1.14.0 and looking forward

Reply via email to