Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Dan Halperin Thu, 16 Feb 2017 08:18:10 -0800

Chiming in a bit late, but here's my 2 cents.

HdfsFileSystem vs Hadoop*InputFormatIO is a red herring:
  * HdfsFileSystem is for file-format-specific, Beam-native, parsers of
files. It will make TextIO, AvroIO, etc., work for files that happen to be
located at hdfs:// URIs.
  * This is complementary to the ability to read things that implement
InputFormat, like Cassandra, Parquet, etc -- many of which aren't even
files. Not redundant.
  * Except, HdfsFileSystem would empower users to use TextIO or AvroIO
directly instead of HdfsIO with a TextInputFormat reader. This would be
better for users on runners that do balanced splitting, dynamic work
rebalancing, etc.

HdfsIO is really a Hadoop *FileInputFormat* reader; Dipti's
HadoopInputFormatIO is indeed a Hadoop *InputFormat* reader. IMO the key
component of the name is "InputFormat" -- these IOs are specific to the
InputFormat API and are intended to work with anything that implements it.
  * Yes, these two have a lot in common as they both operate on
InputFormats.
  * Tthe FileInputFormat reader gets to call some special APIs that the
generic InputFormat reader cannot -- so they are not completely redundant.
Specifically, FileInputFormat reader can do size-based splitting.

I'd recommend one of:
   * Move HdfsIO and HadoopInputFormatIO into a common module (and rename
HdfsIO to reflect that it's really about FileInputFormat).
   * See if we can "inline" the FileInputFormat specific parts of HdfsIO
inside of HadoopInputFormatIO via reflection. If so, we can get the best of
both worlds with shared code.

(And I think renaming to HadoopIO doesn't make sense. "InputFormat" is the
key component of the name -- it reads things that implement the InputFormat
interface. "Hadoop" means a lot more than that.)

On Wed, Feb 15, 2017 at 1:31 PM, Raghu Angadi <rang...@google.com.invalid>
wrote:

> Dipti,
>
> Also how about calling it just HadoopIO?
>
> On Wed, Feb 15, 2017 at 11:13 AM, Raghu Angadi <rang...@google.com> wrote:
>
> > I skimmed through HdfsIO and I think it is essentially
> HahdoopInpuFormatIO
> > with FileInputFormat. I would pretty much move most of the code to
> > HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO).
> >
> > On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni <
> > dipti_dkulka...@persistent.com> wrote:
> >
> >> Hello there!
> >> I am working on writing a Read IO for Hadoop InputFormat. This will
> >> enable reading from any datasource which supports Hadoop InputFormat,
> i.e.
> >> provides source to read from InputFormat for integration with Hadoop.
> >> It makes sense for the HadoopInputFormatIO to share some code with the
> >> HdfsIO - WritableCoder in particular, but also some helper classes like
> >> SerializableSplit etc. I was wondering if we could move HDFS and
> >> HadoopInputFormat into a shared module for Hadoop IO in general instead
> of
> >> maintaining them separately.
> >> Do let me know on what you think, please let me know if you can think of
> >> any other ideas too.
> >>
> >> Thanks,
> >> Dipti
> >>
> >>
> >> DISCLAIMER
> >> ==========
> >> This e-mail may contain privileged and confidential information which is
> >> the property of Persistent Systems Ltd. It is intended only for the use
> of
> >> the individual or entity to which it is addressed. If you are not the
> >> intended recipient, you are not authorized to read, retain, copy, print,
> >> distribute or use this message. If you have received this communication
> in
> >> error, please notify the sender and delete all copies of this message.
> >> Persistent Systems Ltd. does not accept any liability for virus infected
> >> mails.
> >>
> >>
> >
>

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to