Then, it looks like 1.14.0 is very close. That sounds good to me!
On 2024/04/11 14:59:17 Gang Wu wrote: > On my end, the only PR waiting for 1.14 release is [1] and it is very close > to be merged. As the release process is pretty much the same for 1.14.0 > and 1.13.2, I'd prefer to expedite the process for 1.14.0. > > [1] https://github.com/apache/parquet-mr/pull/1139 > > Best, > Gang > > On Thu, Apr 11, 2024 at 6:40 AM Suresh, Adi > <ad...@amazon.com.inva<mailto:ad...@amazon.com.inva>lid> > wrote: > > > Hi, https://issues.apache.org/jira/browse/PARQUET-2450 is current > > affecting many of our customers. > > https://github.com/apache/parquet-mr/pull/1300#issuecomment-2046590751 > > will fix the issue. Can 1.14.0 be expedited? Or can we do a 1.13.2 patch > > release to get this fix out faster? > > > > Let me know if there’s anything on my end that I can do to help. > > > > On 2024/02/27 14:42:39 Fokko Driesprong wrote: > > > Hey everyone, > > > > > > Thanks for the many responses. > > > > > > We should check that parquet-mr implements everything introduced by the > > new > > > > parquet-format release. > > > > > > > > > Good call and I fully agree with that. Let's double check that before > > > starting any releases. > > > > > > We should check on every ongoing PRs and Jira's that seem to be targeting > > > > the next parquet-mr release, and decide if we want to wait for them or > > > > not. > > > > > > > > > I'm happy to do a first pass on that. > > > > > > I am currently doing some work related to direct memory. Not all the > > related > > > > jiras are created. Will try to create them and set 1.14.0 as target. > > Will > > > > try to finalize everything by the end of next week. > > > > > > > > > Thanks, it is not my main area of expertise, but let me know if you need > > a > > > review. I would not want to rush the release if there is still ongoing > > > work, just wanted to get the ball rolling and collect expectations. > > > > > > For the new API, I feel like we're doing a 1.15 and then jump to 2.0, > > which > > > is also totally fine with me. > > > > > > For who's there, see you at the sync! > > > > > > Kind regards, > > > Fokko Driesprong > > > > > > Op do 22 feb 2024 om 13:47 schreef Steve Loughran > > > <st...@cloudera.com.inva<mailto:st...@cloudera.com.inva>>lid>: > > > > > > > Apologies for not making any progress -been too busy with releases. > > > > > > > > This week I am helping Hadoop 3.4.0 out the door. Hopefully we will > > only > > > > need one more iteration to get the packaging right (essentially strip > > out > > > > as many transient JARs as we can). My release module does actually > > build > > > > parquet as one stage in the validation, so I'm happy we aren't breaking > > > > your build. > > > > > > > > Moving to 3.3+ would be absolutely wonderful; it has been out for > > years and > > > > we have fixed many issues as well as done our best to move to less > > insecure > > > > transitive dependencies -that is still ongoing. It is ongoing forever I > > > > suspect. > > > > > > > > Unless you use a release with vector IO (3.3.5+) you'll still need to > > use > > > > reflection there. > > > > > > > > What you will get as soon as you move to 3.3.0 is the openFile() API > > which > > > > lets you > > > > Explicitly declare the read/seek policy of a file. For parquet, > > "random" is > > > > what you want. > > > > Pass in the filestatus or file length when opening a file. For object > > > > stores, that can save the overhead of an HTTP HEAD request as we can > > skip > > > > the probe for the existence and length of the file. > > > > > > > > Random IO is the biggest saving here; s3a FS tries to guess your read > > > > policy and switch to random on the first backwards seek, but it isn't > > > > perfect. > > > > > > > > Regarding vectored read APIs, the Hadoop one maps trivially to the > > java nio > > > > scatter/gather read API. Which can deliver great speed ups on native > > > > storage, especially SSD -more from the ability to do parallel block > > reads > > > > than anything else. What does that mean? use the hadoop raw local fS > > and > > > > you get it. It also means that any non-hadoop java code should use the > > nio > > > > read API directly. > > > > > > > > Anyway: I do plan to get onto that PR request as soon as I get a > > chance. > > > > - add range overlap detection in the parquet code > > > > - make sure all hadoop filesystem reject that too. s3a already does > > AFAIK, > > > > but I want consistency, contract tests and coverage in the > > specification. > > > > > > > > > > > > On Wed, 21 Feb 2024 at 15:30, Gang Wu > > > > <us...@gmail.com<mailto:us...@gmail.com><mailto: > > us...@gmail.com<mailto:us...@gmail.com>>> wrote: > > > > > > > > > Hi, > > > > > > > > > > Thanks for bringing this up! > > > > > > > > > > For the 1.14.0 release, I think it would be good to include some open > > > > > PRs, e.g. [1]. > > > > > > > > > > Thanks Gabor for the idea of new APIs! I agree that we need to clean > > > > > up some misused APIs and remove the Hadoop dependencies. In the > > > > > meanwhile, I actually have some concerns. For example, recently I > > have > > > > > just investigated how ApacheSpark and Apache Iceberg support > > vectorized > > > > > reading parquet. I have seen many code duplication between them but > > > > > have different high-level APIs. If we aim to support similar > > vectorized > > > > > reader based on Arrow vectors, I am not sure if these clients are > > willing > > > > > to > > > > > migrate due to the difference in type system, performance of vector > > > > > conversion, etc. That said, this is worth doing and we need to > > collect > > > > > sufficient feedback from different communities. > > > > > > > > > > [1] https://github.com/apache/parquet-mr/pull/1139 > > > > > > > > > > Best, > > > > > Gang > > > > > > > > > > On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky > > > > > <ga...@apache.org<mailto:ga...@apache.org> > > <ma...@apache.org<mailto:ma...@apache.org>>> > > > > wrote: > > > > > > > > > > > Thanks for bringing this up, Fokko. > > > > > > Unfortunately, I won't be able to join next week. (Hopefully I > > will be > > > > > > there at the one after.) > > > > > > So, let me write my thoughts here. > > > > > > > > > > > > I agree it is time to start preparing the next parquet-mr release. > > I > > > > have > > > > > > some thoughts: > > > > > > - We should check that parquet-mr implements everything introduced > > by > > > > the > > > > > > new parquet-format release > > > > > > - We should check on every ongoing PRs and jiras that seem to be > > > > > targeting > > > > > > the next parquet-mr release, and decide if we want to wait for > > them or > > > > > not > > > > > > - I am currently doing some work related to direct memory. Not all > > the > > > > > > related jiras are created. Will try to create them and set 1.14.0 > > as > > > > > > target. Will try to finalize everything by the end of next week. > > > > > > > > > > > > About parquet-mr 2.0: we need to decide what we expect from it. The > > > > java > > > > > > upgrade is just one thing that even can be done without a major > > version > > > > > > (e.g. separate releases for different java versions) > > > > > > My original thoughts about 2.0 was to provide a new API for our > > clients > > > > > > > > > > > > - We've had many issues because different API users started using > > > > > > classes/methods that were originally implemented for internal use > > only. > > > > > > Like reading the pages directly. > > > > > > > > > > > > - We need to have different levels of APIs that support all current > > > > > > use-cases. e.g.: > > > > > > > > > > > > - Easy to use high level row-wise reading/writing > > > > > > > > > > > > - vectorized reading/writing; probably native support of Arrow > > vectors > > > > > > > > > > > > - We need to get rid of the Hadoop dependencies > > > > > > > > > > > > - The goal is to have a well-defined public API that we share with > > our > > > > > > clients and hide everything else. It is much easier to keep > > backward > > > > > > compatibility for the public API only. > > > > > > > > > > > > - The new API itself does not need a major release. We can start > > > > working > > > > > on > > > > > > it in a separate module. We'll need some minor release cycles to > > build > > > > > it. > > > > > > (We'll need our client's feedback.) What we need a major release > > for is > > > > > > (after having the finalized new API) moving all current public > > classes > > > > to > > > > > > internal modules. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > Gabor > > > > > > > > > > > > > > > > > > > > > > > > Fokko Driesprong <fo...@apache.org<mailto:fo...@apache.org>>> ezt > > írta (időpont: 2024. febr. > > > > 21., > > > > > > Sze, 13:04): > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > I'm seeing some great progress on the Parquet side and it was > > almost > > > > > one > > > > > > > year ago that I ran the last 1.13.1 release (May 2023). Are > > there any > > > > > > > considerations of doing a 1.14.0 release? > > > > > > > > > > > > > > Looking forward, I would like to discuss a Parquet-mr 2.0 > > release. > > > > > > > > > > > > > > - Looking at other projects in the space there are more and > > more > > > > > that > > > > > > > are moving to Java 11+, for example, Spark 4.0 (June 2024) and > > > > > Iceberg > > > > > > > 2.0 > > > > > > > (the first release after 1.5.0 that's being voted on right > > now). > > > > > > > - We currently have support for Hadoop 2.x which is compiled > > > > against > > > > > > > Java 7. I would suggest dropping everything below 3.3 as > > that's > > > > the > > > > > > > minimal > > > > > > > version supporting Java 11 > > > > > > > < > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions > > > > > > >. > > > > > > > Because some APIs changed, we also have to use reflection, > > which > > > > is > > > > > > not > > > > > > > great. > > > > > > > > > > > > > > I would also like to thank Xinli for updating the Parquet Sync > > > > invite. > > > > > I > > > > > > > was there on the 30th of January, but all by myself. The next > > sync > > > > next > > > > > > > week Tuesday would be a great opportunity to go over this topic. > > > > > > > > > > > > > > Looking forward to your thoughts! > > > > > > > > > > > > > > Kind regards, > > > > > > > Fokko Driesprong > > > > > > > > > > > > > > > > > > > > > > > > > > > >