Re: Parquet without Hadoop dependencies

Gang Wu Sat, 10 Jun 2023 03:21:11 -0700

My main concern of breaking change is the effort to take for downstream
projects to adopt the new parquet release. We need to hear more voices
from those communities to make a consensus if breaking changes are
acceptable.


I just took a glance at hadoop dependencies, it seems the major ones are
used for configuration, filesystem and codec. Could we introduce a layer
of interfaces for them and make those hadoop classes as concrete
implementations of them? I think this is the first step to split the core
features
of parquet from hadoop.

Back to the hadoop-client-api proposal, my intention is to support basic
parquet
features with only hadoop-client-api pulled in the dependencies. And use
full feature
with hadoop-client-runtime pulled. Is that possible?

On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <[email protected]>
wrote:

> Hi Gang,
>
> I don't think it's feasible to make a new module for it this way as a lot
> of the support for this part of the code (codecs, etc.) resides in
> parquet-hadoop. This means the module would likely require a dependency on
> parquet-hadoop, making it pretty useless. This could be avoided by porting
> the supporting classes over to this new core module, but that could cause
> similar issues.
> As for replacing the Hadoop dependencies by hadoop-client-api and
> hadoop-client-runtime, this could indeed be nice for some use-cases. It
> could avoid a big chunk of the Hadoop related issues, though we still
> require users to package parts of it. There are some convoluted ways this
> can be achieved now, which we could support out of the box, at least for
> writing to disk. I would like to think of this as more of a temporary
> solution though, as we would still be forcing pretty big dependencies on
> users that oftentimes do not need them.
> It seems to me that properly decoupling the reader/writer code from this
> dependency will likely require breaking changes in the future as it is
> hardwired in a large part of the logic. Maybe something to consider for the
> next major release?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <[email protected]>
> Sent: Friday, June 9, 2023 4:32 PM
> To: [email protected] <[email protected]>
> Subject: Re: Parquet without Hadoop dependencies
>
> That may break many downstream projects. At least we cannot break
> parquet-hadoop (and any existing module). If you can add a new module
> like parquet-core and provide limited reader/writer features without hadoop
> support, and then make parquet-hadoop depend on parquet-core, that
> would be acceptable.
>
> One possible workaround is to replace various Hadoop dependencies
> by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> may be much easier for users to add Hadoop dependency. But they are
> only available from Hadoop 3.0.0.
>
> On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <[email protected]>
> wrote:
>
> > Hi Gang,
> >
> > Backward compatibility does indeed seem challenging here. Especially as
> > I'd rather see the writers/readers moved out of parquet-hadoop after
> > they've been decoupled. What are your thoughts on this?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <[email protected]>
> > Sent: Friday, June 9, 2023 3:32 AM
> > To: [email protected] <[email protected]>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > Hi Atour,
> >
> > Thanks for bringing this up!
> >
> > From what I observed from PARQUET-1822, I think it is a valid use
> > case to support parquet reading/writing without hadoop installed.
> > The challenge is backward compatibility. It would be great if you can
> > work on it.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <[email protected]>
> > wrote:
> >
> > > Dear all,
> > >
> > > The Java implementations of the Parquet readers and writers seem pretty
> > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> > can
> > > cause issues as it's an unnecessary and big dependency when you might
> > just
> > > need to write to disk. Is there any appetite here for separating the
> > Hadoop
> > > code and supporting more convenient ways to write to disk out of the
> > box? I
> > > am willing to work on these changes but would like some pointers on
> > whether
> > > such patches would be reviewed and accepted as PARQUET-1822 has been
> open
> > > for over three years now.
> > >
> > > Best regards,
> > > Atour Mousavi Gourabi
> > >
> >
>

Re: Parquet without Hadoop dependencies

Reply via email to