Re: Parquet without Hadoop dependencies

Gang Wu Sun, 11 Jun 2023 19:11:12 -0700

Yes, a PR would be welcome!

On Sat, Jun 10, 2023 at 7:50 PM Atour Mousavi Gourabi <[email protected]>
wrote:


> Hi Gang,
>
> The breaking changes are a valid concern, so I agree we should consult
> with downstream communities before releasing any.
> Right now, we do already make limited use of the interfaces you describe
> (for the filesystem). These enable users to read and write Parquet without
> installing Hadoop on their system in a slightly convoluted way. They will
> also still need to package the Hadoop dependencies, but it's something I
> think we should support by providing the implementations they'd need to
> make this work out of the box. I can have a PR open for this quickly if you
> agree we should support it.
> As for not packaging hadoop-client-runtime, we would need to first include
> the implementations described above, and then introduce some abstraction
> over at least the Hadoop Configuration. I think this should be feasible to
> implement in a non-breaking way, though I could not give you a timeline.
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <[email protected]>
> Sent: Saturday, June 10, 2023 12:20 PM
> To: [email protected] <[email protected]>
> Subject: Re: Parquet without Hadoop dependencies
>
> My main concern of breaking change is the effort to take for downstream
> projects to adopt the new parquet release. We need to hear more voices
> from those communities to make a consensus if breaking changes are
> acceptable.
>
> I just took a glance at hadoop dependencies, it seems the major ones are
> used for configuration, filesystem and codec. Could we introduce a layer
> of interfaces for them and make those hadoop classes as concrete
> implementations of them? I think this is the first step to split the core
> features
> of parquet from hadoop.
>
> Back to the hadoop-client-api proposal, my intention is to support basic
> parquet
> features with only hadoop-client-api pulled in the dependencies. And use
> full feature
> with hadoop-client-runtime pulled. Is that possible?
>
> On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <[email protected]>
> wrote:
>
> > Hi Gang,
> >
> > I don't think it's feasible to make a new module for it this way as a lot
> > of the support for this part of the code (codecs, etc.) resides in
> > parquet-hadoop. This means the module would likely require a dependency
> on
> > parquet-hadoop, making it pretty useless. This could be avoided by
> porting
> > the supporting classes over to this new core module, but that could cause
> > similar issues.
> > As for replacing the Hadoop dependencies by hadoop-client-api and
> > hadoop-client-runtime, this could indeed be nice for some use-cases. It
> > could avoid a big chunk of the Hadoop related issues, though we still
> > require users to package parts of it. There are some convoluted ways this
> > can be achieved now, which we could support out of the box, at least for
> > writing to disk. I would like to think of this as more of a temporary
> > solution though, as we would still be forcing pretty big dependencies on
> > users that oftentimes do not need them.
> > It seems to me that properly decoupling the reader/writer code from this
> > dependency will likely require breaking changes in the future as it is
> > hardwired in a large part of the logic. Maybe something to consider for
> the
> > next major release?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <[email protected]>
> > Sent: Friday, June 9, 2023 4:32 PM
> > To: [email protected] <[email protected]>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > That may break many downstream projects. At least we cannot break
> > parquet-hadoop (and any existing module). If you can add a new module
> > like parquet-core and provide limited reader/writer features without
> hadoop
> > support, and then make parquet-hadoop depend on parquet-core, that
> > would be acceptable.
> >
> > One possible workaround is to replace various Hadoop dependencies
> > by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> > may be much easier for users to add Hadoop dependency. But they are
> > only available from Hadoop 3.0.0.
> >
> > On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <[email protected]>
> > wrote:
> >
> > > Hi Gang,
> > >
> > > Backward compatibility does indeed seem challenging here. Especially as
> > > I'd rather see the writers/readers moved out of parquet-hadoop after
> > > they've been decoupled. What are your thoughts on this?
> > >
> > > Best regards,
> > > Atour
> > > ________________________________
> > > From: Gang Wu <[email protected]>
> > > Sent: Friday, June 9, 2023 3:32 AM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: Parquet without Hadoop dependencies
> > >
> > > Hi Atour,
> > >
> > > Thanks for bringing this up!
> > >
> > > From what I observed from PARQUET-1822, I think it is a valid use
> > > case to support parquet reading/writing without hadoop installed.
> > > The challenge is backward compatibility. It would be great if you can
> > > work on it.
> > >
> > > Best,
> > > Gang
> > >
> > > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <[email protected]>
> > > wrote:
> > >
> > > > Dear all,
> > > >
> > > > The Java implementations of the Parquet readers and writers seem
> pretty
> > > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects,
> this
> > > can
> > > > cause issues as it's an unnecessary and big dependency when you might
> > > just
> > > > need to write to disk. Is there any appetite here for separating the
> > > Hadoop
> > > > code and supporting more convenient ways to write to disk out of the
> > > box? I
> > > > am willing to work on these changes but would like some pointers on
> > > whether
> > > > such patches would be reviewed and accepted as PARQUET-1822 has been
> > open
> > > > for over three years now.
> > > >
> > > > Best regards,
> > > > Atour Mousavi Gourabi
> > > >
> > >
> >
>

Re: Parquet without Hadoop dependencies

Reply via email to