RE: Merge HadoopInputFormatIO and HDFSIO in a single module

Dipti Kulkarni Fri, 17 Feb 2017 09:39:37 -0800

Thank you  all for your inputs!


-----Original Message-----
From: Dan Halperin [mailto:[email protected]] 
Sent: Friday, February 17, 2017 12:17 PM
To: [email protected]
Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Raghu, Amit -- +1 to your expertise :)

On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <[email protected]> wrote:

> I agree with Dan on everything regarding HdfsFileSystem - it's super 
> convenient for users to use TextIO with HdfsFileSystem rather then 
> replacing the IO and also specifying the InputFormat type.
>
> I disagree on "HadoopIO" - I think that people who work with Hadoop 
> would find this name intuitive, and that's whats important.
> Even more, and joining Raghu's comment, it is also recognized as 
> "compatible with Hadoop", so for example someone running a Beam 
> pipeline using the Spark runner on Amazon's S3 and wants to read/write 
> Hadoop sequence files would simply use HadoopIO and provide the 
> appropriate runtime dependencies (actually true for GS as well).
>
> On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi 
> <[email protected]>
> wrote:
>
> > FileInputFormat is extremely widely used, pretty much all the file 
> > based input formats extend it. All of them call into to list the 
> > input files, split (with some tweaks on top of that). The special 
> > API ( *FileInputFormat.setMinInputSplitSize(job,
> > desiredBundleSizeBytes)* ) is how the split size is normally
> communicated.
> > New IO can use the api directly.
> >
> > HdfsIO as implemented in Beam is not HDFS specific at all. There are 
> > no hdfs imports and HDFS name does not appear anywhere other than in
> HdfsIO's
> > own class and method names. AvroHdfsFileSource etc would work just 
> > as
> well
> > with new IO.
> >
> > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
> <[email protected]
> > >
> > wrote:
> >
> > > (And I think renaming to HadoopIO doesn't make sense. 
> > > "InputFormat" is
> > the
> > > key component of the name -- it reads things that implement the
> > InputFormat
> > > interface. "Hadoop" means a lot more than that.)
> > >
> >
> > Often 'IO' in Beam implies both sources and sinks. It might not be 
> > long before we might be supporting Hadoop OutputFormat as well. In 
> > addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can 
> > mean a lot of things depending on the context. In 'IO' context it 
> > might not be too
> broad.
> > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
> >
> > Either way, I am quite confident once HadoopInputFormatIO is 
> > written, it can easily replace HdfsIO. That decision could be made later.
> >
> > Raghu.
> >
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

RE: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to