Re: Parquet-MR 2.0?

Atour Mousavi Gourabi Mon, 25 Sep 2023 01:15:10 -0700

Hi David,

As Gang mentioned, there is an ongoing effort to remove as much of the Hadoop 
dependency as possible without breaking backward compatibility. This means that 
you will hopefully be able to drop the hadoop-client-runtime dependency when 
using the read/write API once that is done. Changes that allow dropping 
hadoop-client-api would sadly be breaking backward compatibility for now.
The master branch currently includes a patch[1] that allows you to avoid 
loading Hadoop's Path class. This means you will not have to worry about the 
compatibility issues Hadoop faces on Windows systems (meaning you will not need 
winutils.exe) in the future. AFAIK this change will be part of the next minor 
release, though in the meantime you can build from master or copy the 
implementations yourself as well.
Given the current level of activity I do not think Parquet MR 2.0 is feasible 
anytime soon, but the issues you mentioned have been recognised and we are 
trying to mitigate their effects as much as possible without breaking backward 
compatibility within the current Parquet MR 1.X.X framework.


[1] https://github.com/apache/parquet-mr/pull/1111

All the best,
Atour
________________________________
From: Gang Wu <[email protected]>
Sent: Monday, September 25, 2023 4:12 AM
To: [email protected] <[email protected]>
Subject: Re: Parquet-MR 2.0?

Hi David,

There is already a mailing list discussion [1] and a JIRA issue [2]. Please
take a look and let me know what you think. There is also an open PR [3]
which may interest you.

[1] https://lists.apache.org/thread/d33757j99xqn63hrfz415sq60v3x9hmy
[2] https://issues.apache.org/jira/browse/PARQUET-1822
[3] https://github.com/apache/parquet-mr/pull/1141

Best,
Gang

On Mon, Sep 25, 2023 at 9:49 AM David <[email protected]> wrote:

> Hello Folks,
>
> Probably a repeat, so my apologies in advance.
>
> Is there any appetite for a Parquet 2.0?
>
> In my mind, the greatest need is to cut the dependency on Hadoop and allow
> simply for the Parquet file format to exists on its own.
>
> I was recently considering a project by which a light-weight stand-alone
> application can exist that reads Iceberg Tables (Parquet) data.  My use
> case includes a lot of readers on slow-moving data.  Essentially a mini
> HBase-like client that can read data either from S3 or a local file system.
>
> Anyway, I started putting together a quick PoC and forgot that I needed to
> carry with me so very many Hadoop JARs (and their dependencies).  I also
> hit a snack trying to test on a Windows work laptop because the hadoop file
> IO librarians require some sort of specialized binary support shims.
>
> So, the main goal of version 2 would be to develop Parquet library as a
> stand-alone pure Java framework and the other packages (e.g., hadoop,
> protobuf, etc.) would be offered as additional extensions.
>
> So the package structure would be something like:
>
> - parquet-api (InputSource, ParquetReader, ParquetWriter, etc)
> - parquet-core (the actual parquet framework)
> - parquet-hadoop (e g., Simple InputSource Implementation, Splitters, etc.)
>
> Thanks.
>

Re: Parquet-MR 2.0?

Reply via email to