Re: Parquet without Hadoop dependencies

Atour Mousavi Gourabi Sat, 10 Jun 2023 04:50:54 -0700

Hi Gang,

The breaking changes are a valid concern, so I agree we should consult with 
downstream communities before releasing any.
Right now, we do already make limited use of the interfaces you describe (for 
the filesystem). These enable users to read and write Parquet without 
installing Hadoop on their system in a slightly convoluted way. They will also 
still need to package the Hadoop dependencies, but it's something I think we 
should support by providing the implementations they'd need to make this work 
out of the box. I can have a PR open for this quickly if you agree we should 
support it.
As for not packaging hadoop-client-runtime, we would need to first include the 
implementations described above, and then introduce some abstraction over at 
least the Hadoop Configuration. I think this should be feasible to implement in 
a non-breaking way, though I could not give you a timeline.


Best regards,
Atour
________________________________
From: Gang Wu <[email protected]>
Sent: Saturday, June 10, 2023 12:20 PM
To: [email protected] <[email protected]>
Subject: Re: Parquet without Hadoop dependencies

My main concern of breaking change is the effort to take for downstream
projects to adopt the new parquet release. We need to hear more voices
from those communities to make a consensus if breaking changes are
acceptable.

I just took a glance at hadoop dependencies, it seems the major ones are
used for configuration, filesystem and codec. Could we introduce a layer
of interfaces for them and make those hadoop classes as concrete
implementations of them? I think this is the first step to split the core
features
of parquet from hadoop.

Back to the hadoop-client-api proposal, my intention is to support basic
parquet
features with only hadoop-client-api pulled in the dependencies. And use
full feature
with hadoop-client-runtime pulled. Is that possible?

On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <[email protected]>
wrote:

> Hi Gang,
>
> I don't think it's feasible to make a new module for it this way as a lot
> of the support for this part of the code (codecs, etc.) resides in
> parquet-hadoop. This means the module would likely require a dependency on
> parquet-hadoop, making it pretty useless. This could be avoided by porting
> the supporting classes over to this new core module, but that could cause
> similar issues.
> As for replacing the Hadoop dependencies by hadoop-client-api and
> hadoop-client-runtime, this could indeed be nice for some use-cases. It
> could avoid a big chunk of the Hadoop related issues, though we still
> require users to package parts of it. There are some convoluted ways this
> can be achieved now, which we could support out of the box, at least for
> writing to disk. I would like to think of this as more of a temporary
> solution though, as we would still be forcing pretty big dependencies on
> users that oftentimes do not need them.
> It seems to me that properly decoupling the reader/writer code from this
> dependency will likely require breaking changes in the future as it is
> hardwired in a large part of the logic. Maybe something to consider for the
> next major release?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <[email protected]>
> Sent: Friday, June 9, 2023 4:32 PM
> To: [email protected] <[email protected]>
> Subject: Re: Parquet without Hadoop dependencies
>
> That may break many downstream projects. At least we cannot break
> parquet-hadoop (and any existing module). If you can add a new module
> like parquet-core and provide limited reader/writer features without hadoop
> support, and then make parquet-hadoop depend on parquet-core, that
> would be acceptable.
>
> One possible workaround is to replace various Hadoop dependencies
> by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> may be much easier for users to add Hadoop dependency. But they are
> only available from Hadoop 3.0.0.
>
> On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <[email protected]>
> wrote:
>
> > Hi Gang,
> >
> > Backward compatibility does indeed seem challenging here. Especially as
> > I'd rather see the writers/readers moved out of parquet-hadoop after
> > they've been decoupled. What are your thoughts on this?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <[email protected]>
> > Sent: Friday, June 9, 2023 3:32 AM
> > To: [email protected] <[email protected]>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > Hi Atour,
> >
> > Thanks for bringing this up!
> >
> > From what I observed from PARQUET-1822, I think it is a valid use
> > case to support parquet reading/writing without hadoop installed.
> > The challenge is backward compatibility. It would be great if you can
> > work on it.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <[email protected]>
> > wrote:
> >
> > > Dear all,
> > >
> > > The Java implementations of the Parquet readers and writers seem pretty
> > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> > can
> > > cause issues as it's an unnecessary and big dependency when you might
> > just
> > > need to write to disk. Is there any appetite here for separating the
> > Hadoop
> > > code and supporting more convenient ways to write to disk out of the
> > box? I
> > > am willing to work on these changes but would like some pointers on
> > whether
> > > such patches would be reviewed and accepted as PARQUET-1822 has been
> open
> > > for over three years now.
> > >
> > > Best regards,
> > > Atour Mousavi Gourabi
> > >
> >
>

Re: Parquet without Hadoop dependencies

Reply via email to