Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Stephen Sisk Thu, 02 Mar 2017 10:48:16 -0800

Thanks for your time thinking about this!

Sounds like we all like #1. That's great. I see that Ismael commented on
the PR to suggest the specific changes, so I think we should be good to go.


To answer JB's question about the later merging:
> when BEAM-59 will be done, hadoop IO will only contains the hadoop format
support (no HdfsFileSource or HdfsSink required as it will use the
"regular" FileIO).
I think there are still some improvements to be had by having at least some
code unique to HdfsFileSource/Sink. Dan called out above "the
FileInputFormat reader gets to call some special APIs that the
generic InputFormat reader cannot -- so they are not completely
redundant. Specifically,
FileInputFormat reader can do size-based splitting."

Dan recommended: "See if we can "inline" the FileInputFormat specific parts
of HdfsIO inside of HadoopInputFormatIO via reflection. If so, we can get
the best of both worlds with shared code." This seems reasonable to me.

I created BEAM-1592 so we can further discuss there if need be.

S

On Thu, Mar 2, 2017 at 6:45 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> Hello,
>
> I answer since I have been leading the refactor to hadoop-common. My
> criteria to move a class into hadoop-common is that it is used at least by
> more than one other module or IO, this is the reason is not big, but it can
> grow if needed.
>
> +1 for option #1 because of the visibility reasons you mention.
> For the concrete PR I have the following remarks:
>
> From looking at the PR I think today you can already do the basic refactors
> that depend on hadoop-common (to avoid adding repeated code):
> - Remove NullWritableCoder from hadoop/inputformat and refactor to use
> WritableCoder from hadoop-common.
> - Remove WritableCoder from hadoop/inputformat and refactor to use
> WritableCoder from hadoop-common.
>
> I have other comments but since those are not directly related to the
> refactoring I will address those in the PR.
>
> Thanks for bringing this issue back to the mailing-list Stephen.
> Ismaël
> 
>
>
> On Thu, Mar 2, 2017 at 3:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > By the way Stephen, when BEAM-59 will be done, hadoop IO will only
> > contains the hadoop format support (no HdfsFileSource or HdfsSink
> required
> > as it will use the "regular" FileIO).
> >
> > Agree ?
> >
> > Regards
> > JB
> >
> >
> > On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote:
> >
> >> Hi Stephen,
> >>
> >> I agree to use the following structure (and it's basically what I
> >> proposed in a comment of the PR):
> >>
> >> io/hadoop
> >> io/hadoop-common
> >> io/hbase
> >>
> >> I would be more than happy to help on the "merge" of HdfsIO and
> >> HadoopFormat.
> >>
> >> Regards
> >> JB
> >>
> >> On 03/01/2017 08:00 PM, Stephen Sisk wrote:
> >>
> >>> I wanted to follow up on this thread since I see some potential
> blocking
> >>> questions arising, and I'm trying to help dipti along with her PR.
> >>>
> >>> Dipti's PR[1] is currently written to put files into:
> >>> io/hadoop/inputformat
> >>>
> >>> The recent changes to create hadoop-common created:
> >>> io/hadoop-common
> >>>
> >>> This means that the overall structure if we take the HIFIO PR as-is
> would
> >>> be:
> >>> io/hadoop/inputformat - the HIFIO (copies of some code in
> >>> hadoop-common and
> >>> hdfs, but no dependency on hadoop-common)
> >>> io/hadoop-common - module with some shared code
> >>> io/hbase - hbase IO transforms
> >>> io/hdfs - FileInputFormat IO transforms - much shared code with
> >>> hadoop/inputformat.
> >>>
> >>> Which I don't think is great b/c there's a common dir, but only some
> >>> directories use it, and there's lots of similar-but-slightly different
> >>> code
> >>> in hadoop/inputformat and hdfsio. I don't believe anyone intends this
> >>> to be
> >>> the final result.
> >>>
> >>> After looking at the comments in this thread, I'd like to recommend the
> >>> following end-result:  (#1)
> >>> io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
> >>> HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents
> of
> >>> hdfs and hadoop/inputformat)
> >>> io/hadoop-common - module with some shared code
> >>> io/hbase - hbase IO transforms
> >>>
> >>> To get there I propose the following steps:
> >>> 1. finish current PR [1] with only renaming the containing module from
> >>> hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
> >>> 2. someone does cleanup to reconcile hdfs and hadoop directories,
> >>> including
> >>> renaming the files so they make sense
> >>>
> >>> I would also be fine with: (#2)
> >>> io/hadoop - container dir only
> >>> io/hadoop/common
> >>> io/hadoop/hbase
> >>> io/hadoop/inputformat
> >>>
> >>> I think the downside of #2 is that it hides hbase, which I think
> deserves
> >>> to be top level.
> >>>
> >>> Other comments:
> >>> It should be noted that when we have all modules use hadoop-common,
> we'll
> >>> be forcing all hadoop modules to have the same dependencies on hadoop
> - I
> >>> think this makes sense, but worth noting that as the one advantage of
> the
> >>> "every hadoop IO transform has its own hadoop dependency"
> >>>
> >>> On the naming discussion: I personally prefer "inputformat" as the
> >>> name of
> >>> the directory, but I defer to the folks who know the hadoop community
> >>> more.
> >>>
> >>> S
> >>>
> >>> [1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
> >>> [2] HdfsIO dependency change PR -
> >>> https://github.com/apache/beam/pull/2087
> >>>
> >>>
> >>> On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
> >>> dipti_dkulka...@persistent.com> wrote:
> >>>
> >>> Thank you  all for your inputs!
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
> >>>> Sent: Friday, February 17, 2017 12:17 PM
> >>>> To: dev@beam.apache.org
> >>>> Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module
> >>>>
> >>>> Raghu, Amit -- +1 to your expertise :)
> >>>>
> >>>> On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> I agree with Dan on everything regarding HdfsFileSystem - it's super
> >>>>> convenient for users to use TextIO with HdfsFileSystem rather then
> >>>>> replacing the IO and also specifying the InputFormat type.
> >>>>>
> >>>>> I disagree on "HadoopIO" - I think that people who work with Hadoop
> >>>>> would find this name intuitive, and that's whats important.
> >>>>> Even more, and joining Raghu's comment, it is also recognized as
> >>>>> "compatible with Hadoop", so for example someone running a Beam
> >>>>> pipeline using the Spark runner on Amazon's S3 and wants to
> read/write
> >>>>> Hadoop sequence files would simply use HadoopIO and provide the
> >>>>> appropriate runtime dependencies (actually true for GS as well).
> >>>>>
> >>>>> On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi
> >>>>> <rang...@google.com.invalid>
> >>>>> wrote:
> >>>>>
> >>>>> FileInputFormat is extremely widely used, pretty much all the file
> >>>>>> based input formats extend it. All of them call into to list the
> >>>>>> input files, split (with some tweaks on top of that). The special
> >>>>>> API ( *FileInputFormat.setMinInputSplitSize(job,
> >>>>>> desiredBundleSizeBytes)* ) is how the split size is normally
> >>>>>>
> >>>>> communicated.
> >>>>>
> >>>>>> New IO can use the api directly.
> >>>>>>
> >>>>>> HdfsIO as implemented in Beam is not HDFS specific at all. There are
> >>>>>> no hdfs imports and HDFS name does not appear anywhere other than in
> >>>>>>
> >>>>> HdfsIO's
> >>>>>
> >>>>>> own class and method names. AvroHdfsFileSource etc would work just
> >>>>>> as
> >>>>>>
> >>>>> well
> >>>>>
> >>>>>> with new IO.
> >>>>>>
> >>>>>> On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
> >>>>>>
> >>>>> <dhalp...@google.com.invalid
> >>>>>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>
> >>>>>> (And I think renaming to HadoopIO doesn't make sense.
> >>>>>>> "InputFormat" is
> >>>>>>>
> >>>>>> the
> >>>>>>
> >>>>>>> key component of the name -- it reads things that implement the
> >>>>>>>
> >>>>>> InputFormat
> >>>>>>
> >>>>>>> interface. "Hadoop" means a lot more than that.)
> >>>>>>>
> >>>>>>>
> >>>>>> Often 'IO' in Beam implies both sources and sinks. It might not be
> >>>>>> long before we might be supporting Hadoop OutputFormat as well. In
> >>>>>> addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can
> >>>>>> mean a lot of things depending on the context. In 'IO' context it
> >>>>>> might not be too
> >>>>>>
> >>>>> broad.
> >>>>>
> >>>>>> Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
> >>>>>>
> >>>>>> Either way, I am quite confident once HadoopInputFormatIO is
> >>>>>> written, it can easily replace HdfsIO. That decision could be made
> >>>>>>
> >>>>> later.
> >>>>
> >>>>>
> >>>>>> Raghu.
> >>>>>>
> >>>>>>
> >>>>>
> >>>> DISCLAIMER
> >>>> ==========
> >>>> This e-mail may contain privileged and confidential information which
> is
> >>>> the property of Persistent Systems Ltd. It is intended only for the
> >>>> use of
> >>>> the individual or entity to which it is addressed. If you are not the
> >>>> intended recipient, you are not authorized to read, retain, copy,
> print,
> >>>> distribute or use this message. If you have received this
> >>>> communication in
> >>>> error, please notify the sender and delete all copies of this message.
> >>>> Persistent Systems Ltd. does not accept any liability for virus
> infected
> >>>> mails.
> >>>>
> >>>>
> >>>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to