Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Jean-Baptiste Onofré Thu, 02 Mar 2017 06:28:46 -0800

Hi Stephen,

I agree to use the following structure (and it's basically what Iproposed in a comment of the PR):


io/hadoop
io/hadoop-common
io/hbase

I would be more than happy to help on the "merge" of HdfsIO andHadoopFormat.


Regards
JB

On 03/01/2017 08:00 PM, Stephen Sisk wrote:

I wanted to follow up on this thread since I see some potential blocking
questions arising, and I'm trying to help dipti along with her PR.

Dipti's PR[1] is currently written to put files into:
io/hadoop/inputformat

The recent changes to create hadoop-common created:
io/hadoop-common

This means that the overall structure if we take the HIFIO PR as-is would
be:
io/hadoop/inputformat - the HIFIO (copies of some code in hadoop-common and
hdfs, but no dependency on hadoop-common)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms
io/hdfs - FileInputFormat IO transforms - much shared code with
hadoop/inputformat.

Which I don't think is great b/c there's a common dir, but only some
directories use it, and there's lots of similar-but-slightly different code
in hadoop/inputformat and hdfsio. I don't believe anyone intends this to be
the final result.

After looking at the comments in this thread, I'd like to recommend the
following end-result:  (#1)
io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
hdfs and hadoop/inputformat)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms

To get there I propose the following steps:
1. finish current PR [1] with only renaming the containing module from
hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
2. someone does cleanup to reconcile hdfs and hadoop directories, including
renaming the files so they make sense

I would also be fine with: (#2)
io/hadoop - container dir only
io/hadoop/common
io/hadoop/hbase
io/hadoop/inputformat

I think the downside of #2 is that it hides hbase, which I think deserves
to be top level.

Other comments:
It should be noted that when we have all modules use hadoop-common, we'll
be forcing all hadoop modules to have the same dependencies on hadoop - I
think this makes sense, but worth noting that as the one advantage of the
"every hadoop IO transform has its own hadoop dependency"

On the naming discussion: I personally prefer "inputformat" as the name of
the directory, but I defer to the folks who know the hadoop community more.

S

[1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
[2] HdfsIO dependency change PR - https://github.com/apache/beam/pull/2087


On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:

Thank you  all for your inputs!


-----Original Message-----
From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
Sent: Friday, February 17, 2017 12:17 PM
To: dev@beam.apache.org
Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Raghu, Amit -- +1 to your expertise :)

On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote:

I agree with Dan on everything regarding HdfsFileSystem - it's super
convenient for users to use TextIO with HdfsFileSystem rather then
replacing the IO and also specifying the InputFormat type.

I disagree on "HadoopIO" - I think that people who work with Hadoop
would find this name intuitive, and that's whats important.
Even more, and joining Raghu's comment, it is also recognized as
"compatible with Hadoop", so for example someone running a Beam
pipeline using the Spark runner on Amazon's S3 and wants to read/write
Hadoop sequence files would simply use HadoopIO and provide the
appropriate runtime dependencies (actually true for GS as well).

On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi
<rang...@google.com.invalid>
wrote:

FileInputFormat is extremely widely used, pretty much all the file
based input formats extend it. All of them call into to list the
input files, split (with some tweaks on top of that). The special
API ( *FileInputFormat.setMinInputSplitSize(job,
desiredBundleSizeBytes)* ) is how the split size is normally

communicated.

New IO can use the api directly.

HdfsIO as implemented in Beam is not HDFS specific at all. There are
no hdfs imports and HDFS name does not appear anywhere other than in

HdfsIO's

own class and method names. AvroHdfsFileSource etc would work just
as

well

with new IO.

On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin

<dhalp...@google.com.invalid

wrote:

(And I think renaming to HadoopIO doesn't make sense.
"InputFormat" is

the

key component of the name -- it reads things that implement the

InputFormat

interface. "Hadoop" means a lot more than that.)


Often 'IO' in Beam implies both sources and sinks. It might not be
long before we might be supporting Hadoop OutputFormat as well. In
addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can
mean a lot of things depending on the context. In 'IO' context it
might not be too

broad.

Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.

Either way, I am quite confident once HadoopInputFormatIO is
written, it can easily replace HdfsIO. That decision could be made

later.


Raghu.


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is
the property of Persistent Systems Ltd. It is intended only for the use of
the individual or entity to which it is addressed. If you are not the
intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in
error, please notify the sender and delete all copies of this message.
Persistent Systems Ltd. does not accept any liability for virus infected
mails.


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to