My main concern of breaking change is the effort to take for downstream projects to adopt the new parquet release. We need to hear more voices from those communities to make a consensus if breaking changes are acceptable.
I just took a glance at hadoop dependencies, it seems the major ones are used for configuration, filesystem and codec. Could we introduce a layer of interfaces for them and make those hadoop classes as concrete implementations of them? I think this is the first step to split the core features of parquet from hadoop. Back to the hadoop-client-api proposal, my intention is to support basic parquet features with only hadoop-client-api pulled in the dependencies. And use full feature with hadoop-client-runtime pulled. Is that possible? On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <[email protected]> wrote: > Hi Gang, > > I don't think it's feasible to make a new module for it this way as a lot > of the support for this part of the code (codecs, etc.) resides in > parquet-hadoop. This means the module would likely require a dependency on > parquet-hadoop, making it pretty useless. This could be avoided by porting > the supporting classes over to this new core module, but that could cause > similar issues. > As for replacing the Hadoop dependencies by hadoop-client-api and > hadoop-client-runtime, this could indeed be nice for some use-cases. It > could avoid a big chunk of the Hadoop related issues, though we still > require users to package parts of it. There are some convoluted ways this > can be achieved now, which we could support out of the box, at least for > writing to disk. I would like to think of this as more of a temporary > solution though, as we would still be forcing pretty big dependencies on > users that oftentimes do not need them. > It seems to me that properly decoupling the reader/writer code from this > dependency will likely require breaking changes in the future as it is > hardwired in a large part of the logic. Maybe something to consider for the > next major release? > > Best regards, > Atour > ________________________________ > From: Gang Wu <[email protected]> > Sent: Friday, June 9, 2023 4:32 PM > To: [email protected] <[email protected]> > Subject: Re: Parquet without Hadoop dependencies > > That may break many downstream projects. At least we cannot break > parquet-hadoop (and any existing module). If you can add a new module > like parquet-core and provide limited reader/writer features without hadoop > support, and then make parquet-hadoop depend on parquet-core, that > would be acceptable. > > One possible workaround is to replace various Hadoop dependencies > by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This > may be much easier for users to add Hadoop dependency. But they are > only available from Hadoop 3.0.0. > > On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <[email protected]> > wrote: > > > Hi Gang, > > > > Backward compatibility does indeed seem challenging here. Especially as > > I'd rather see the writers/readers moved out of parquet-hadoop after > > they've been decoupled. What are your thoughts on this? > > > > Best regards, > > Atour > > ________________________________ > > From: Gang Wu <[email protected]> > > Sent: Friday, June 9, 2023 3:32 AM > > To: [email protected] <[email protected]> > > Subject: Re: Parquet without Hadoop dependencies > > > > Hi Atour, > > > > Thanks for bringing this up! > > > > From what I observed from PARQUET-1822, I think it is a valid use > > case to support parquet reading/writing without hadoop installed. > > The challenge is backward compatibility. It would be great if you can > > work on it. > > > > Best, > > Gang > > > > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <[email protected]> > > wrote: > > > > > Dear all, > > > > > > The Java implementations of the Parquet readers and writers seem pretty > > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this > > can > > > cause issues as it's an unnecessary and big dependency when you might > > just > > > need to write to disk. Is there any appetite here for separating the > > Hadoop > > > code and supporting more convenient ways to write to disk out of the > > box? I > > > am willing to work on these changes but would like some pointers on > > whether > > > such patches would be reviewed and accepted as PARQUET-1822 has been > open > > > for over three years now. > > > > > > Best regards, > > > Atour Mousavi Gourabi > > > > > >
