Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Ismaël Mejía Thu, 02 Mar 2017 06:46:54 -0800

Hello,

I answer since I have been leading the refactor to hadoop-common. My
criteria to move a class into hadoop-common is that it is used at least by
more than one other module or IO, this is the reason is not big, but it can
grow if needed.


+1 for option #1 because of the visibility reasons you mention.
For the concrete PR I have the following remarks:

>From looking at the PR I think today you can already do the basic refactors
that depend on hadoop-common (to avoid adding repeated code):
- Remove NullWritableCoder from hadoop/inputformat and refactor to use
WritableCoder from hadoop-common.
- Remove WritableCoder from hadoop/inputformat and refactor to use
WritableCoder from hadoop-common.

I have other comments but since those are not directly related to the
refactoring I will address those in the PR.

Thanks for bringing this issue back to the mailing-list Stephen.
Ismaël



On Thu, Mar 2, 2017 at 3:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> By the way Stephen, when BEAM-59 will be done, hadoop IO will only
> contains the hadoop format support (no HdfsFileSource or HdfsSink required
> as it will use the "regular" FileIO).
>
> Agree ?
>
> Regards
> JB
>
>
> On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote:
>
>> Hi Stephen,
>>
>> I agree to use the following structure (and it's basically what I
>> proposed in a comment of the PR):
>>
>> io/hadoop
>> io/hadoop-common
>> io/hbase
>>
>> I would be more than happy to help on the "merge" of HdfsIO and
>> HadoopFormat.
>>
>> Regards
>> JB
>>
>> On 03/01/2017 08:00 PM, Stephen Sisk wrote:
>>
>>> I wanted to follow up on this thread since I see some potential blocking
>>> questions arising, and I'm trying to help dipti along with her PR.
>>>
>>> Dipti's PR[1] is currently written to put files into:
>>> io/hadoop/inputformat
>>>
>>> The recent changes to create hadoop-common created:
>>> io/hadoop-common
>>>
>>> This means that the overall structure if we take the HIFIO PR as-is would
>>> be:
>>> io/hadoop/inputformat - the HIFIO (copies of some code in
>>> hadoop-common and
>>> hdfs, but no dependency on hadoop-common)
>>> io/hadoop-common - module with some shared code
>>> io/hbase - hbase IO transforms
>>> io/hdfs - FileInputFormat IO transforms - much shared code with
>>> hadoop/inputformat.
>>>
>>> Which I don't think is great b/c there's a common dir, but only some
>>> directories use it, and there's lots of similar-but-slightly different
>>> code
>>> in hadoop/inputformat and hdfsio. I don't believe anyone intends this
>>> to be
>>> the final result.
>>>
>>> After looking at the comments in this thread, I'd like to recommend the
>>> following end-result:  (#1)
>>> io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
>>> HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
>>> hdfs and hadoop/inputformat)
>>> io/hadoop-common - module with some shared code
>>> io/hbase - hbase IO transforms
>>>
>>> To get there I propose the following steps:
>>> 1. finish current PR [1] with only renaming the containing module from
>>> hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
>>> 2. someone does cleanup to reconcile hdfs and hadoop directories,
>>> including
>>> renaming the files so they make sense
>>>
>>> I would also be fine with: (#2)
>>> io/hadoop - container dir only
>>> io/hadoop/common
>>> io/hadoop/hbase
>>> io/hadoop/inputformat
>>>
>>> I think the downside of #2 is that it hides hbase, which I think deserves
>>> to be top level.
>>>
>>> Other comments:
>>> It should be noted that when we have all modules use hadoop-common, we'll
>>> be forcing all hadoop modules to have the same dependencies on hadoop - I
>>> think this makes sense, but worth noting that as the one advantage of the
>>> "every hadoop IO transform has its own hadoop dependency"
>>>
>>> On the naming discussion: I personally prefer "inputformat" as the
>>> name of
>>> the directory, but I defer to the folks who know the hadoop community
>>> more.
>>>
>>> S
>>>
>>> [1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
>>> [2] HdfsIO dependency change PR -
>>> https://github.com/apache/beam/pull/2087
>>>
>>>
>>> On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
>>> dipti_dkulka...@persistent.com> wrote:
>>>
>>> Thank you  all for your inputs!
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
>>>> Sent: Friday, February 17, 2017 12:17 PM
>>>> To: dev@beam.apache.org
>>>> Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module
>>>>
>>>> Raghu, Amit -- +1 to your expertise :)
>>>>
>>>> On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>
>>>> wrote:
>>>>
>>>> I agree with Dan on everything regarding HdfsFileSystem - it's super
>>>>> convenient for users to use TextIO with HdfsFileSystem rather then
>>>>> replacing the IO and also specifying the InputFormat type.
>>>>>
>>>>> I disagree on "HadoopIO" - I think that people who work with Hadoop
>>>>> would find this name intuitive, and that's whats important.
>>>>> Even more, and joining Raghu's comment, it is also recognized as
>>>>> "compatible with Hadoop", so for example someone running a Beam
>>>>> pipeline using the Spark runner on Amazon's S3 and wants to read/write
>>>>> Hadoop sequence files would simply use HadoopIO and provide the
>>>>> appropriate runtime dependencies (actually true for GS as well).
>>>>>
>>>>> On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi
>>>>> <rang...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>> FileInputFormat is extremely widely used, pretty much all the file
>>>>>> based input formats extend it. All of them call into to list the
>>>>>> input files, split (with some tweaks on top of that). The special
>>>>>> API ( *FileInputFormat.setMinInputSplitSize(job,
>>>>>> desiredBundleSizeBytes)* ) is how the split size is normally
>>>>>>
>>>>> communicated.
>>>>>
>>>>>> New IO can use the api directly.
>>>>>>
>>>>>> HdfsIO as implemented in Beam is not HDFS specific at all. There are
>>>>>> no hdfs imports and HDFS name does not appear anywhere other than in
>>>>>>
>>>>> HdfsIO's
>>>>>
>>>>>> own class and method names. AvroHdfsFileSource etc would work just
>>>>>> as
>>>>>>
>>>>> well
>>>>>
>>>>>> with new IO.
>>>>>>
>>>>>> On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
>>>>>>
>>>>> <dhalp...@google.com.invalid
>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>> (And I think renaming to HadoopIO doesn't make sense.
>>>>>>> "InputFormat" is
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> key component of the name -- it reads things that implement the
>>>>>>>
>>>>>> InputFormat
>>>>>>
>>>>>>> interface. "Hadoop" means a lot more than that.)
>>>>>>>
>>>>>>>
>>>>>> Often 'IO' in Beam implies both sources and sinks. It might not be
>>>>>> long before we might be supporting Hadoop OutputFormat as well. In
>>>>>> addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can
>>>>>> mean a lot of things depending on the context. In 'IO' context it
>>>>>> might not be too
>>>>>>
>>>>> broad.
>>>>>
>>>>>> Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
>>>>>>
>>>>>> Either way, I am quite confident once HadoopInputFormatIO is
>>>>>> written, it can easily replace HdfsIO. That decision could be made
>>>>>>
>>>>> later.
>>>>
>>>>>
>>>>>> Raghu.
>>>>>>
>>>>>>
>>>>>
>>>> DISCLAIMER
>>>> ==========
>>>> This e-mail may contain privileged and confidential information which is
>>>> the property of Persistent Systems Ltd. It is intended only for the
>>>> use of
>>>> the individual or entity to which it is addressed. If you are not the
>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>> distribute or use this message. If you have received this
>>>> communication in
>>>> error, please notify the sender and delete all copies of this message.
>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>> mails.
>>>>
>>>>
>>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to