Chiming in a bit late, but here's my 2 cents. HdfsFileSystem vs Hadoop*InputFormatIO is a red herring: * HdfsFileSystem is for file-format-specific, Beam-native, parsers of files. It will make TextIO, AvroIO, etc., work for files that happen to be located at hdfs:// URIs. * This is complementary to the ability to read things that implement InputFormat, like Cassandra, Parquet, etc -- many of which aren't even files. Not redundant. * Except, HdfsFileSystem would empower users to use TextIO or AvroIO directly instead of HdfsIO with a TextInputFormat reader. This would be better for users on runners that do balanced splitting, dynamic work rebalancing, etc.
HdfsIO is really a Hadoop *FileInputFormat* reader; Dipti's HadoopInputFormatIO is indeed a Hadoop *InputFormat* reader. IMO the key component of the name is "InputFormat" -- these IOs are specific to the InputFormat API and are intended to work with anything that implements it. * Yes, these two have a lot in common as they both operate on InputFormats. * Tthe FileInputFormat reader gets to call some special APIs that the generic InputFormat reader cannot -- so they are not completely redundant. Specifically, FileInputFormat reader can do size-based splitting. I'd recommend one of: * Move HdfsIO and HadoopInputFormatIO into a common module (and rename HdfsIO to reflect that it's really about FileInputFormat). * See if we can "inline" the FileInputFormat specific parts of HdfsIO inside of HadoopInputFormatIO via reflection. If so, we can get the best of both worlds with shared code. (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is the key component of the name -- it reads things that implement the InputFormat interface. "Hadoop" means a lot more than that.) On Wed, Feb 15, 2017 at 1:31 PM, Raghu Angadi <rang...@google.com.invalid> wrote: > Dipti, > > Also how about calling it just HadoopIO? > > On Wed, Feb 15, 2017 at 11:13 AM, Raghu Angadi <rang...@google.com> wrote: > > > I skimmed through HdfsIO and I think it is essentially > HahdoopInpuFormatIO > > with FileInputFormat. I would pretty much move most of the code to > > HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO). > > > > On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni < > > dipti_dkulka...@persistent.com> wrote: > > > >> Hello there! > >> I am working on writing a Read IO for Hadoop InputFormat. This will > >> enable reading from any datasource which supports Hadoop InputFormat, > i.e. > >> provides source to read from InputFormat for integration with Hadoop. > >> It makes sense for the HadoopInputFormatIO to share some code with the > >> HdfsIO - WritableCoder in particular, but also some helper classes like > >> SerializableSplit etc. I was wondering if we could move HDFS and > >> HadoopInputFormat into a shared module for Hadoop IO in general instead > of > >> maintaining them separately. > >> Do let me know on what you think, please let me know if you can think of > >> any other ideas too. > >> > >> Thanks, > >> Dipti > >> > >> > >> DISCLAIMER > >> ========== > >> This e-mail may contain privileged and confidential information which is > >> the property of Persistent Systems Ltd. It is intended only for the use > of > >> the individual or entity to which it is addressed. If you are not the > >> intended recipient, you are not authorized to read, retain, copy, print, > >> distribute or use this message. If you have received this communication > in > >> error, please notify the sender and delete all copies of this message. > >> Persistent Systems Ltd. does not accept any liability for virus infected > >> mails. > >> > >> > > >