Parquet-MR 2.0?

David Sun, 24 Sep 2023 18:49:34 -0700

Hello Folks,

Probably a repeat, so my apologies in advance.


Is there any appetite for a Parquet 2.0?

In my mind, the greatest need is to cut the dependency on Hadoop and allow
simply for the Parquet file format to exists on its own.

I was recently considering a project by which a light-weight stand-alone
application can exist that reads Iceberg Tables (Parquet) data.  My use
case includes a lot of readers on slow-moving data.  Essentially a mini
HBase-like client that can read data either from S3 or a local file system.

Anyway, I started putting together a quick PoC and forgot that I needed to
carry with me so very many Hadoop JARs (and their dependencies).  I also
hit a snack trying to test on a Windows work laptop because the hadoop file
IO librarians require some sort of specialized binary support shims.

So, the main goal of version 2 would be to develop Parquet library as a
stand-alone pure Java framework and the other packages (e.g., hadoop,
protobuf, etc.) would be offered as additional extensions.

So the package structure would be something like:

- parquet-api (InputSource, ParquetReader, ParquetWriter, etc)
- parquet-core (the actual parquet framework)
- parquet-hadoop (e g., Simple InputSource Implementation, Splitters, etc.)

Thanks.

Parquet-MR 2.0?

Reply via email to