Ram, Aim of this concrete operator is write incoming tuples to HDFS files.
Main use-case being : data is read from some source, processed tuple-by-tuple by some operators and then given to this proposed concrete operator for writing to HDFS. As you pointed out, file operation is another common use-case; but we can work out separate mechanism which handles the complexities explained in your post. Priyanka has already posted about proposal for HDFS input module having FileSplitter + BlockReader operator. I will post another proposal for HDFS file copy module which would seamlessly integrate with HDFS input module to solve file copy use-case. Question: Is it acceptable if we have concrete operator (current proposal) for tuple-by-tuple writing and have separate module to take care of file copy use-cases? ~ Yogi On 6 March 2016 at 09:45, Munagala Ramanath <[email protected]> wrote: > Since the AbstractFileInputOperator provides a concrete implementation > (FileLineInputOperator in the same file) > it seems reasonable to have one for the output operator as well. > > Another basic and reasonable requirement is that it should be possible to > connect the input and output operators > without any further fussing and get a robust and high performance > application for copying files from source to > destination. There are a number of issues that crop up in doing this > though: The input operator can read and > dispatch tuples from multiple files in the same window; how does it tell > the output operator where the file > boundaries are ? Special control tuples sent inline are one possibility; > control tuples sent via a separate port > are another. Tagging each tuple with the file name is a third. Each has > additional aspects to consider > such as impact on performance, time skew between multiple input ports, etc. > > Ram > > On Thu, Mar 3, 2016 at 5:51 PM, Yogi Devendra <[email protected]> > wrote: > > > Any suggestions/ comments on this? > > > > ~ Yogi > > > > On 3 March 2016 at 17:44, Yogi Devendra <[email protected]> wrote: > > > > > Hi, > > > > > > Currently, for writing to HDFS file we have AbstractFileOutputOperator > in > > > the malhar library. > > > > > > It has following abstract methods : > > > 1. protected abstract String getFileName(INPUT tuple) > > > 2. protected abstract byte[] getBytesForTuple(INPUT tuple) > > > > > > These methods are kept generic to give flexibility to the app > developers. > > > But, someone who is new to apex; would look for ready-made > implementation > > > instead of extending Abstract implementation. > > > > > > Thus, I am proposing to add concrete operator HDFSOutputOperator to > > > malhar. Aim of this operator would be to serve the purpose of ready to > > use > > > operator for most frequent use-cases. > > > > > > Here are my key observations on most frequent use-cases: > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > 1. Writing tuples of type byte[] or String. > > > 2. All tuples on a particular stream land up in the same output file. > > > 3. App developer may want to add some custom tuple separator (e.g. > > newline > > > character) between tuples. > > > > > > Please mention your comments regarding : > > > -------------------------------------------------------- > > > > > > 1. Will it be useful to have such concrete operator? > > > > > > 2. Do you think of any other datatype other than byte[], String that > > > should be supported out of the box by this concrete operator? > > > Currently, I am planning to include byte[], String, any other type > having > > > valid toString() as input tuples. > > > > > > 3. Do you think tuple separator should be configurable? > > > > > > 4. Any other feedback? > > > > > > > > > Proposed design: > > > ---------------------- > > > > > > 1. This concrete implementation will be extending > > > AbstractFileOutputOperator with default implementation for abstract > > methods > > > mentioned above. > > > > > > 2. Filename , Tuple separator will be exposed as a operator property. > > > > > > 3. All incoming tuples will be written to same file mentioned in the > > > property. > > > > > > 4. This operator will be added to malhar library under package > > > com.datatorrent.lib.io.fs where AbstractFileOutputOperator resides. > > > > > > ~ Yogi > > > > > >
