Hi David, As Gang mentioned, there is an ongoing effort to remove as much of the Hadoop dependency as possible without breaking backward compatibility. This means that you will hopefully be able to drop the hadoop-client-runtime dependency when using the read/write API once that is done. Changes that allow dropping hadoop-client-api would sadly be breaking backward compatibility for now. The master branch currently includes a patch[1] that allows you to avoid loading Hadoop's Path class. This means you will not have to worry about the compatibility issues Hadoop faces on Windows systems (meaning you will not need winutils.exe) in the future. AFAIK this change will be part of the next minor release, though in the meantime you can build from master or copy the implementations yourself as well. Given the current level of activity I do not think Parquet MR 2.0 is feasible anytime soon, but the issues you mentioned have been recognised and we are trying to mitigate their effects as much as possible without breaking backward compatibility within the current Parquet MR 1.X.X framework.
[1] https://github.com/apache/parquet-mr/pull/1111 All the best, Atour ________________________________ From: Gang Wu <ust...@gmail.com> Sent: Monday, September 25, 2023 4:12 AM To: dev@parquet.apache.org <dev@parquet.apache.org> Subject: Re: Parquet-MR 2.0? Hi David, There is already a mailing list discussion [1] and a JIRA issue [2]. Please take a look and let me know what you think. There is also an open PR [3] which may interest you. [1] https://lists.apache.org/thread/d33757j99xqn63hrfz415sq60v3x9hmy [2] https://issues.apache.org/jira/browse/PARQUET-1822 [3] https://github.com/apache/parquet-mr/pull/1141 Best, Gang On Mon, Sep 25, 2023 at 9:49 AM David <dam6...@gmail.com> wrote: > Hello Folks, > > Probably a repeat, so my apologies in advance. > > Is there any appetite for a Parquet 2.0? > > In my mind, the greatest need is to cut the dependency on Hadoop and allow > simply for the Parquet file format to exists on its own. > > I was recently considering a project by which a light-weight stand-alone > application can exist that reads Iceberg Tables (Parquet) data. My use > case includes a lot of readers on slow-moving data. Essentially a mini > HBase-like client that can read data either from S3 or a local file system. > > Anyway, I started putting together a quick PoC and forgot that I needed to > carry with me so very many Hadoop JARs (and their dependencies). I also > hit a snack trying to test on a Windows work laptop because the hadoop file > IO librarians require some sort of specialized binary support shims. > > So, the main goal of version 2 would be to develop Parquet library as a > stand-alone pure Java framework and the other packages (e.g., hadoop, > protobuf, etc.) would be offered as additional extensions. > > So the package structure would be something like: > > - parquet-api (InputSource, ParquetReader, ParquetWriter, etc) > - parquet-core (the actual parquet framework) > - parquet-hadoop (e g., Simple InputSource Implementation, Splitters, etc.) > > Thanks. >