Yes, a PR would be welcome! On Sat, Jun 10, 2023 at 7:50 PM Atour Mousavi Gourabi <[email protected]> wrote:
> Hi Gang, > > The breaking changes are a valid concern, so I agree we should consult > with downstream communities before releasing any. > Right now, we do already make limited use of the interfaces you describe > (for the filesystem). These enable users to read and write Parquet without > installing Hadoop on their system in a slightly convoluted way. They will > also still need to package the Hadoop dependencies, but it's something I > think we should support by providing the implementations they'd need to > make this work out of the box. I can have a PR open for this quickly if you > agree we should support it. > As for not packaging hadoop-client-runtime, we would need to first include > the implementations described above, and then introduce some abstraction > over at least the Hadoop Configuration. I think this should be feasible to > implement in a non-breaking way, though I could not give you a timeline. > > Best regards, > Atour > ________________________________ > From: Gang Wu <[email protected]> > Sent: Saturday, June 10, 2023 12:20 PM > To: [email protected] <[email protected]> > Subject: Re: Parquet without Hadoop dependencies > > My main concern of breaking change is the effort to take for downstream > projects to adopt the new parquet release. We need to hear more voices > from those communities to make a consensus if breaking changes are > acceptable. > > I just took a glance at hadoop dependencies, it seems the major ones are > used for configuration, filesystem and codec. Could we introduce a layer > of interfaces for them and make those hadoop classes as concrete > implementations of them? I think this is the first step to split the core > features > of parquet from hadoop. > > Back to the hadoop-client-api proposal, my intention is to support basic > parquet > features with only hadoop-client-api pulled in the dependencies. And use > full feature > with hadoop-client-runtime pulled. Is that possible? > > On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <[email protected]> > wrote: > > > Hi Gang, > > > > I don't think it's feasible to make a new module for it this way as a lot > > of the support for this part of the code (codecs, etc.) resides in > > parquet-hadoop. This means the module would likely require a dependency > on > > parquet-hadoop, making it pretty useless. This could be avoided by > porting > > the supporting classes over to this new core module, but that could cause > > similar issues. > > As for replacing the Hadoop dependencies by hadoop-client-api and > > hadoop-client-runtime, this could indeed be nice for some use-cases. It > > could avoid a big chunk of the Hadoop related issues, though we still > > require users to package parts of it. There are some convoluted ways this > > can be achieved now, which we could support out of the box, at least for > > writing to disk. I would like to think of this as more of a temporary > > solution though, as we would still be forcing pretty big dependencies on > > users that oftentimes do not need them. > > It seems to me that properly decoupling the reader/writer code from this > > dependency will likely require breaking changes in the future as it is > > hardwired in a large part of the logic. Maybe something to consider for > the > > next major release? > > > > Best regards, > > Atour > > ________________________________ > > From: Gang Wu <[email protected]> > > Sent: Friday, June 9, 2023 4:32 PM > > To: [email protected] <[email protected]> > > Subject: Re: Parquet without Hadoop dependencies > > > > That may break many downstream projects. At least we cannot break > > parquet-hadoop (and any existing module). If you can add a new module > > like parquet-core and provide limited reader/writer features without > hadoop > > support, and then make parquet-hadoop depend on parquet-core, that > > would be acceptable. > > > > One possible workaround is to replace various Hadoop dependencies > > by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This > > may be much easier for users to add Hadoop dependency. But they are > > only available from Hadoop 3.0.0. > > > > On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <[email protected]> > > wrote: > > > > > Hi Gang, > > > > > > Backward compatibility does indeed seem challenging here. Especially as > > > I'd rather see the writers/readers moved out of parquet-hadoop after > > > they've been decoupled. What are your thoughts on this? > > > > > > Best regards, > > > Atour > > > ________________________________ > > > From: Gang Wu <[email protected]> > > > Sent: Friday, June 9, 2023 3:32 AM > > > To: [email protected] <[email protected]> > > > Subject: Re: Parquet without Hadoop dependencies > > > > > > Hi Atour, > > > > > > Thanks for bringing this up! > > > > > > From what I observed from PARQUET-1822, I think it is a valid use > > > case to support parquet reading/writing without hadoop installed. > > > The challenge is backward compatibility. It would be great if you can > > > work on it. > > > > > > Best, > > > Gang > > > > > > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <[email protected]> > > > wrote: > > > > > > > Dear all, > > > > > > > > The Java implementations of the Parquet readers and writers seem > pretty > > > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, > this > > > can > > > > cause issues as it's an unnecessary and big dependency when you might > > > just > > > > need to write to disk. Is there any appetite here for separating the > > > Hadoop > > > > code and supporting more convenient ways to write to disk out of the > > > box? I > > > > am willing to work on these changes but would like some pointers on > > > whether > > > > such patches would be reviewed and accepted as PARQUET-1822 has been > > open > > > > for over three years now. > > > > > > > > Best regards, > > > > Atour Mousavi Gourabi > > > > > > > > > >
