RE: Re: Re: [DISCUSS] Parquet 1.14.0 and looking forward

Suresh, Adi Thu, 11 Apr 2024 15:01:01 -0700

Then, it looks like 1.14.0 is very close. That sounds good to me!


On 2024/04/11 14:59:17 Gang Wu wrote:
> On my end, the only PR waiting for 1.14 release is [1] and it is very close
> to be merged. As the release process is pretty much the same for 1.14.0
> and 1.13.2, I'd prefer to expedite the process for 1.14.0.
>
> [1] https://github.com/apache/parquet-mr/pull/1139
>
> Best,
> Gang
>
> On Thu, Apr 11, 2024 at 6:40 AM Suresh, Adi 
> <ad...@amazon.com.inva<mailto:ad...@amazon.com.inva>lid>
> wrote:
>
> > Hi, https://issues.apache.org/jira/browse/PARQUET-2450 is current
> > affecting many of our customers.
> > https://github.com/apache/parquet-mr/pull/1300#issuecomment-2046590751
> > will fix the issue. Can 1.14.0 be expedited? Or can we do a 1.13.2 patch
> > release to get this fix out faster?
> >
> > Let me know if there’s anything on my end that I can do to help.
> >
> > On 2024/02/27 14:42:39 Fokko Driesprong wrote:
> > > Hey everyone,
> > >
> > > Thanks for the many responses.
> > >
> > > We should check that parquet-mr implements everything introduced by the
> > new
> > > > parquet-format release.
> > >
> > >
> > > Good call and I fully agree with that. Let's double check that before
> > > starting any releases.
> > >
> > > We should check on every ongoing PRs and Jira's that seem to be targeting
> > > > the next parquet-mr release, and decide if we want to wait for them or
> > > > not.
> > >
> > >
> > > I'm happy to do a first pass on that.
> > >
> > > I am currently doing some work related to direct memory. Not all the
> > related
> > > > jiras are created. Will try to create them and set 1.14.0 as target.
> > Will
> > > > try to finalize everything by the end of next week.
> > >
> > >
> > > Thanks, it is not my main area of expertise, but let me know if you need
> > a
> > > review. I would not want to rush the release if there is still ongoing
> > > work, just wanted to get the ball rolling and collect expectations.
> > >
> > > For the new API, I feel like we're doing a 1.15 and then jump to 2.0,
> > which
> > > is also totally fine with me.
> > >
> > > For who's there, see you at the sync!
> > >
> > > Kind regards,
> > > Fokko Driesprong
> > >
> > > Op do 22 feb 2024 om 13:47 schreef Steve Loughran
> > > <st...@cloudera.com.inva<mailto:st...@cloudera.com.inva>>lid>:
> > >
> > > > Apologies for not making any progress -been too busy with releases.
> > > >
> > > > This week I am helping Hadoop 3.4.0 out the door. Hopefully we will
> > only
> > > > need one more iteration to get the packaging right (essentially strip
> > out
> > > > as many transient JARs as we can). My release module does actually
> > build
> > > > parquet as one stage in the validation, so I'm happy we aren't breaking
> > > > your build.
> > > >
> > > > Moving to 3.3+ would be absolutely wonderful; it has been out for
> > years and
> > > > we have fixed many issues as well as done our best to move to less
> > insecure
> > > > transitive dependencies -that is still ongoing. It is ongoing forever I
> > > > suspect.
> > > >
> > > > Unless you use a release with vector IO (3.3.5+) you'll still need to
> > use
> > > > reflection there.
> > > >
> > > > What you will get as soon as you move to 3.3.0 is the openFile() API
> > which
> > > > lets you
> > > > Explicitly declare the read/seek policy of a file. For parquet,
> > "random" is
> > > > what you want.
> > > > Pass in the filestatus or file length when opening a file. For object
> > > > stores, that can save the overhead of an HTTP HEAD request as we can
> > skip
> > > > the probe for the existence and length of the file.
> > > >
> > > > Random IO is the biggest saving here; s3a FS tries to guess your read
> > > > policy and switch to random on the first backwards seek, but it isn't
> > > > perfect.
> > > >
> > > > Regarding vectored read APIs, the Hadoop one maps trivially to the
> > java nio
> > > > scatter/gather read API. Which can deliver great speed ups on native
> > > > storage, especially SSD -more from the ability to do parallel block
> > reads
> > > > than anything else. What does that mean? use the hadoop raw local fS
> > and
> > > > you get it. It also means that any non-hadoop java code should use the
> > nio
> > > > read API directly.
> > > >
> > > > Anyway: I do plan to get onto that PR request as soon as I get a
> > chance.
> > > > - add range overlap detection in the parquet code
> > > > - make sure all hadoop filesystem reject that too. s3a already does
> > AFAIK,
> > > > but I want consistency, contract tests and coverage in the
> > specification.
> > > >
> > > >
> > > > On Wed, 21 Feb 2024 at 15:30, Gang Wu 
> > > > <us...@gmail.com<mailto:us...@gmail.com><mailto:
> > us...@gmail.com<mailto:us...@gmail.com>>> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for bringing this up!
> > > > >
> > > > > For the 1.14.0 release, I think it would be good to include some open
> > > > > PRs, e.g. [1].
> > > > >
> > > > > Thanks Gabor for the idea of new APIs! I agree that we need to clean
> > > > > up some misused APIs and remove the Hadoop dependencies. In the
> > > > > meanwhile, I actually have some concerns. For example, recently I
> > have
> > > > > just investigated how ApacheSpark and Apache Iceberg support
> > vectorized
> > > > > reading parquet. I have seen many code duplication between them but
> > > > > have different high-level APIs. If we aim to support similar
> > vectorized
> > > > > reader based on Arrow vectors, I am not sure if these clients are
> > willing
> > > > > to
> > > > > migrate due to the difference in type system, performance of vector
> > > > > conversion, etc. That said, this is worth doing and we need to
> > collect
> > > > > sufficient feedback from different communities.
> > > > >
> > > > > [1] https://github.com/apache/parquet-mr/pull/1139
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky 
> > > > > <ga...@apache.org<mailto:ga...@apache.org>
> > <ma...@apache.org<mailto:ma...@apache.org>>>
> > > > wrote:
> > > > >
> > > > > > Thanks for bringing this up, Fokko.
> > > > > > Unfortunately, I won't be able to join next week. (Hopefully I
> > will be
> > > > > > there at the one after.)
> > > > > > So, let me write my thoughts here.
> > > > > >
> > > > > > I agree it is time to start preparing the next parquet-mr release.
> > I
> > > > have
> > > > > > some thoughts:
> > > > > > - We should check that parquet-mr implements everything introduced
> > by
> > > > the
> > > > > > new parquet-format release
> > > > > > - We should check on every ongoing PRs and jiras that seem to be
> > > > > targeting
> > > > > > the next parquet-mr release, and decide if we want to wait for
> > them or
> > > > > not
> > > > > > - I am currently doing some work related to direct memory. Not all
> > the
> > > > > > related jiras are created. Will try to create them and set 1.14.0
> > as
> > > > > > target. Will try to finalize everything by the end of next week.
> > > > > >
> > > > > > About parquet-mr 2.0: we need to decide what we expect from it. The
> > > > java
> > > > > > upgrade is just one thing that even can be done without a major
> > version
> > > > > > (e.g. separate releases for different java versions)
> > > > > > My original thoughts about 2.0 was to provide a new API for our
> > clients
> > > > > >
> > > > > > - We've had many issues because different API users started using
> > > > > > classes/methods that were originally implemented for internal use
> > only.
> > > > > > Like reading the pages directly.
> > > > > >
> > > > > > - We need to have different levels of APIs that support all current
> > > > > > use-cases. e.g.:
> > > > > >
> > > > > > - Easy to use high level row-wise reading/writing
> > > > > >
> > > > > > - vectorized reading/writing; probably native support of Arrow
> > vectors
> > > > > >
> > > > > > - We need to get rid of the Hadoop dependencies
> > > > > >
> > > > > > - The goal is to have a well-defined public API that we share with
> > our
> > > > > > clients and hide everything else. It is much easier to keep
> > backward
> > > > > > compatibility for the public API only.
> > > > > >
> > > > > > - The new API itself does not need a major release. We can start
> > > > working
> > > > > on
> > > > > > it in a separate module. We'll need some minor release cycles to
> > build
> > > > > it.
> > > > > > (We'll need our client's feedback.) What we need a major release
> > for is
> > > > > > (after having the finalized new API) moving all current public
> > classes
> > > > to
> > > > > > internal modules.
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > > Gabor
> > > > > >
> > > > > >
> > > > > >
> > > > > > Fokko Driesprong <fo...@apache.org<mailto:fo...@apache.org>>> ezt
> > írta (időpont: 2024. febr.
> > > > 21.,
> > > > > > Sze, 13:04):
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I'm seeing some great progress on the Parquet side and it was
> > almost
> > > > > one
> > > > > > > year ago that I ran the last 1.13.1 release (May 2023). Are
> > there any
> > > > > > > considerations of doing a 1.14.0 release?
> > > > > > >
> > > > > > > Looking forward, I would like to discuss a Parquet-mr 2.0
> > release.
> > > > > > >
> > > > > > >    - Looking at other projects in the space there are more and
> > more
> > > > > that
> > > > > > >    are moving to Java 11+, for example, Spark 4.0 (June 2024) and
> > > > > Iceberg
> > > > > > > 2.0
> > > > > > >    (the first release after 1.5.0 that's being voted on right
> > now).
> > > > > > >    - We currently have support for Hadoop 2.x which is compiled
> > > > against
> > > > > > >    Java 7. I would suggest dropping everything below 3.3 as
> > that's
> > > > the
> > > > > > > minimal
> > > > > > >    version supporting Java 11
> > > > > > >    <
> > > > > > >
> > > > >
> > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
> > > > > > >.
> > > > > > >    Because some APIs changed, we also have to use reflection,
> > which
> > > > is
> > > > > > not
> > > > > > >    great.
> > > > > > >
> > > > > > > I would also like to thank Xinli for updating the Parquet Sync
> > > > invite.
> > > > > I
> > > > > > > was there on the 30th of January, but all by myself. The next
> > sync
> > > > next
> > > > > > > week Tuesday would be a great opportunity to go over this topic.
> > > > > > >
> > > > > > > Looking forward to your thoughts!
> > > > > > >
> > > > > > > Kind regards,
> > > > > > > Fokko Driesprong
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: Re: Re: [DISCUSS] Parquet 1.14.0 and looking forward

Reply via email to