Hi David, There is already a mailing list discussion [1] and a JIRA issue [2]. Please take a look and let me know what you think. There is also an open PR [3] which may interest you.
[1] https://lists.apache.org/thread/d33757j99xqn63hrfz415sq60v3x9hmy [2] https://issues.apache.org/jira/browse/PARQUET-1822 [3] https://github.com/apache/parquet-mr/pull/1141 Best, Gang On Mon, Sep 25, 2023 at 9:49 AM David <[email protected]> wrote: > Hello Folks, > > Probably a repeat, so my apologies in advance. > > Is there any appetite for a Parquet 2.0? > > In my mind, the greatest need is to cut the dependency on Hadoop and allow > simply for the Parquet file format to exists on its own. > > I was recently considering a project by which a light-weight stand-alone > application can exist that reads Iceberg Tables (Parquet) data. My use > case includes a lot of readers on slow-moving data. Essentially a mini > HBase-like client that can read data either from S3 or a local file system. > > Anyway, I started putting together a quick PoC and forgot that I needed to > carry with me so very many Hadoop JARs (and their dependencies). I also > hit a snack trying to test on a Windows work laptop because the hadoop file > IO librarians require some sort of specialized binary support shims. > > So, the main goal of version 2 would be to develop Parquet library as a > stand-alone pure Java framework and the other packages (e.g., hadoop, > protobuf, etc.) would be offered as additional extensions. > > So the package structure would be something like: > > - parquet-api (InputSource, ParquetReader, ParquetWriter, etc) > - parquet-core (the actual parquet framework) > - parquet-hadoop (e g., Simple InputSource Implementation, Splitters, etc.) > > Thanks. >
