My main concern of breaking change is the effort to take for downstream
projects to adopt the new parquet release. We need to hear more voices
from those communities to make a consensus if breaking changes are
acceptable.

I just took a glance at hadoop dependencies, it seems the major ones are
used for configuration, filesystem and codec. Could we introduce a layer
of interfaces for them and make those hadoop classes as concrete
implementations of them? I think this is the first step to split the core
features
of parquet from hadoop.

Back to the hadoop-client-api proposal, my intention is to support basic
parquet
features with only hadoop-client-api pulled in the dependencies. And use
full feature
with hadoop-client-runtime pulled. Is that possible?

On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <[email protected]>
wrote:

> Hi Gang,
>
> I don't think it's feasible to make a new module for it this way as a lot
> of the support for this part of the code (codecs, etc.) resides in
> parquet-hadoop. This means the module would likely require a dependency on
> parquet-hadoop, making it pretty useless. This could be avoided by porting
> the supporting classes over to this new core module, but that could cause
> similar issues.
> As for replacing the Hadoop dependencies by hadoop-client-api and
> hadoop-client-runtime, this could indeed be nice for some use-cases. It
> could avoid a big chunk of the Hadoop related issues, though we still
> require users to package parts of it. There are some convoluted ways this
> can be achieved now, which we could support out of the box, at least for
> writing to disk. I would like to think of this as more of a temporary
> solution though, as we would still be forcing pretty big dependencies on
> users that oftentimes do not need them.
> It seems to me that properly decoupling the reader/writer code from this
> dependency will likely require breaking changes in the future as it is
> hardwired in a large part of the logic. Maybe something to consider for the
> next major release?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <[email protected]>
> Sent: Friday, June 9, 2023 4:32 PM
> To: [email protected] <[email protected]>
> Subject: Re: Parquet without Hadoop dependencies
>
> That may break many downstream projects. At least we cannot break
> parquet-hadoop (and any existing module). If you can add a new module
> like parquet-core and provide limited reader/writer features without hadoop
> support, and then make parquet-hadoop depend on parquet-core, that
> would be acceptable.
>
> One possible workaround is to replace various Hadoop dependencies
> by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> may be much easier for users to add Hadoop dependency. But they are
> only available from Hadoop 3.0.0.
>
> On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <[email protected]>
> wrote:
>
> > Hi Gang,
> >
> > Backward compatibility does indeed seem challenging here. Especially as
> > I'd rather see the writers/readers moved out of parquet-hadoop after
> > they've been decoupled. What are your thoughts on this?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <[email protected]>
> > Sent: Friday, June 9, 2023 3:32 AM
> > To: [email protected] <[email protected]>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > Hi Atour,
> >
> > Thanks for bringing this up!
> >
> > From what I observed from PARQUET-1822, I think it is a valid use
> > case to support parquet reading/writing without hadoop installed.
> > The challenge is backward compatibility. It would be great if you can
> > work on it.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <[email protected]>
> > wrote:
> >
> > > Dear all,
> > >
> > > The Java implementations of the Parquet readers and writers seem pretty
> > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> > can
> > > cause issues as it's an unnecessary and big dependency when you might
> > just
> > > need to write to disk. Is there any appetite here for separating the
> > Hadoop
> > > code and supporting more convenient ways to write to disk out of the
> > box? I
> > > am willing to work on these changes but would like some pointers on
> > whether
> > > such patches would be reviewed and accepted as PARQUET-1822 has been
> open
> > > for over three years now.
> > >
> > > Best regards,
> > > Atour Mousavi Gourabi
> > >
> >
>

Reply via email to