Hello, I answer since I have been leading the refactor to hadoop-common. My criteria to move a class into hadoop-common is that it is used at least by more than one other module or IO, this is the reason is not big, but it can grow if needed.
+1 for option #1 because of the visibility reasons you mention. For the concrete PR I have the following remarks: >From looking at the PR I think today you can already do the basic refactors that depend on hadoop-common (to avoid adding repeated code): - Remove NullWritableCoder from hadoop/inputformat and refactor to use WritableCoder from hadoop-common. - Remove WritableCoder from hadoop/inputformat and refactor to use WritableCoder from hadoop-common. I have other comments but since those are not directly related to the refactoring I will address those in the PR. Thanks for bringing this issue back to the mailing-list Stephen. Ismaël On Thu, Mar 2, 2017 at 3:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > By the way Stephen, when BEAM-59 will be done, hadoop IO will only > contains the hadoop format support (no HdfsFileSource or HdfsSink required > as it will use the "regular" FileIO). > > Agree ? > > Regards > JB > > > On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote: > >> Hi Stephen, >> >> I agree to use the following structure (and it's basically what I >> proposed in a comment of the PR): >> >> io/hadoop >> io/hadoop-common >> io/hbase >> >> I would be more than happy to help on the "merge" of HdfsIO and >> HadoopFormat. >> >> Regards >> JB >> >> On 03/01/2017 08:00 PM, Stephen Sisk wrote: >> >>> I wanted to follow up on this thread since I see some potential blocking >>> questions arising, and I'm trying to help dipti along with her PR. >>> >>> Dipti's PR[1] is currently written to put files into: >>> io/hadoop/inputformat >>> >>> The recent changes to create hadoop-common created: >>> io/hadoop-common >>> >>> This means that the overall structure if we take the HIFIO PR as-is would >>> be: >>> io/hadoop/inputformat - the HIFIO (copies of some code in >>> hadoop-common and >>> hdfs, but no dependency on hadoop-common) >>> io/hadoop-common - module with some shared code >>> io/hbase - hbase IO transforms >>> io/hdfs - FileInputFormat IO transforms - much shared code with >>> hadoop/inputformat. >>> >>> Which I don't think is great b/c there's a common dir, but only some >>> directories use it, and there's lots of similar-but-slightly different >>> code >>> in hadoop/inputformat and hdfsio. I don't believe anyone intends this >>> to be >>> the final result. >>> >>> After looking at the comments in this thread, I'd like to recommend the >>> following end-result: (#1) >>> io/hadoop - the HIFIO (dependency on hadoop-common) - contains both >>> HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of >>> hdfs and hadoop/inputformat) >>> io/hadoop-common - module with some shared code >>> io/hbase - hbase IO transforms >>> >>> To get there I propose the following steps: >>> 1. finish current PR [1] with only renaming the containing module from >>> hadoop/inputformat -> hadoop, and taking dependency on hadoop-common >>> 2. someone does cleanup to reconcile hdfs and hadoop directories, >>> including >>> renaming the files so they make sense >>> >>> I would also be fine with: (#2) >>> io/hadoop - container dir only >>> io/hadoop/common >>> io/hadoop/hbase >>> io/hadoop/inputformat >>> >>> I think the downside of #2 is that it hides hbase, which I think deserves >>> to be top level. >>> >>> Other comments: >>> It should be noted that when we have all modules use hadoop-common, we'll >>> be forcing all hadoop modules to have the same dependencies on hadoop - I >>> think this makes sense, but worth noting that as the one advantage of the >>> "every hadoop IO transform has its own hadoop dependency" >>> >>> On the naming discussion: I personally prefer "inputformat" as the >>> name of >>> the directory, but I defer to the folks who know the hadoop community >>> more. >>> >>> S >>> >>> [1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994 >>> [2] HdfsIO dependency change PR - >>> https://github.com/apache/beam/pull/2087 >>> >>> >>> On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni < >>> dipti_dkulka...@persistent.com> wrote: >>> >>> Thank you all for your inputs! >>>> >>>> >>>> -----Original Message----- >>>> From: Dan Halperin [mailto:dhalp...@google.com.INVALID] >>>> Sent: Friday, February 17, 2017 12:17 PM >>>> To: dev@beam.apache.org >>>> Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module >>>> >>>> Raghu, Amit -- +1 to your expertise :) >>>> >>>> On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> >>>> wrote: >>>> >>>> I agree with Dan on everything regarding HdfsFileSystem - it's super >>>>> convenient for users to use TextIO with HdfsFileSystem rather then >>>>> replacing the IO and also specifying the InputFormat type. >>>>> >>>>> I disagree on "HadoopIO" - I think that people who work with Hadoop >>>>> would find this name intuitive, and that's whats important. >>>>> Even more, and joining Raghu's comment, it is also recognized as >>>>> "compatible with Hadoop", so for example someone running a Beam >>>>> pipeline using the Spark runner on Amazon's S3 and wants to read/write >>>>> Hadoop sequence files would simply use HadoopIO and provide the >>>>> appropriate runtime dependencies (actually true for GS as well). >>>>> >>>>> On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi >>>>> <rang...@google.com.invalid> >>>>> wrote: >>>>> >>>>> FileInputFormat is extremely widely used, pretty much all the file >>>>>> based input formats extend it. All of them call into to list the >>>>>> input files, split (with some tweaks on top of that). The special >>>>>> API ( *FileInputFormat.setMinInputSplitSize(job, >>>>>> desiredBundleSizeBytes)* ) is how the split size is normally >>>>>> >>>>> communicated. >>>>> >>>>>> New IO can use the api directly. >>>>>> >>>>>> HdfsIO as implemented in Beam is not HDFS specific at all. There are >>>>>> no hdfs imports and HDFS name does not appear anywhere other than in >>>>>> >>>>> HdfsIO's >>>>> >>>>>> own class and method names. AvroHdfsFileSource etc would work just >>>>>> as >>>>>> >>>>> well >>>>> >>>>>> with new IO. >>>>>> >>>>>> On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin >>>>>> >>>>> <dhalp...@google.com.invalid >>>>> >>>>>> >>>>>>> wrote: >>>>>> >>>>>> (And I think renaming to HadoopIO doesn't make sense. >>>>>>> "InputFormat" is >>>>>>> >>>>>> the >>>>>> >>>>>>> key component of the name -- it reads things that implement the >>>>>>> >>>>>> InputFormat >>>>>> >>>>>>> interface. "Hadoop" means a lot more than that.) >>>>>>> >>>>>>> >>>>>> Often 'IO' in Beam implies both sources and sinks. It might not be >>>>>> long before we might be supporting Hadoop OutputFormat as well. In >>>>>> addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can >>>>>> mean a lot of things depending on the context. In 'IO' context it >>>>>> might not be too >>>>>> >>>>> broad. >>>>> >>>>>> Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'. >>>>>> >>>>>> Either way, I am quite confident once HadoopInputFormatIO is >>>>>> written, it can easily replace HdfsIO. That decision could be made >>>>>> >>>>> later. >>>> >>>>> >>>>>> Raghu. >>>>>> >>>>>> >>>>> >>>> DISCLAIMER >>>> ========== >>>> This e-mail may contain privileged and confidential information which is >>>> the property of Persistent Systems Ltd. It is intended only for the >>>> use of >>>> the individual or entity to which it is addressed. If you are not the >>>> intended recipient, you are not authorized to read, retain, copy, print, >>>> distribute or use this message. If you have received this >>>> communication in >>>> error, please notify the sender and delete all copies of this message. >>>> Persistent Systems Ltd. does not accept any liability for virus infected >>>> mails. >>>> >>>> >>>> >>> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >