Re: Proposal for concrete operator for writing to HDFS file

Yogi Devendra Mon, 07 Mar 2016 00:53:44 -0800

Ram,


> Does "from some source" specifically exclude files ? If so, we should
> explicitly state this.


No. Even file based sources will be allowed. But, when we are processing
tuple-by-tuple we believe that each record from the file is separate entity
and thus can be processed independently.

For file copy use-case, this is not true. We need to maintain original
sequence from the source at the destination. Hence, it would need
additional metadata with each tuple like which file, what offset etc.

Thus proposal is to have following 4 components:

   1. HDFS input per tuple basis
   2. HDFS input for file copy
   3. HDFS output per tuple basis
   4. HDFS output for file copy

Each of these components will have separate email thread for the proposal.

#2, #4 can be connected together (with other operators in-between which
works on blocks) to solve file copy usecase. Idea of keeping them separate
is because signatures for ports are different for tuple based and file copy
usecases.


>From end-use perspective:

   - Tuple based input/output would process one record/line from file.
   Each record is processed independently. Here tuple represents raw data.
   - Whereas, in case of file copy ports would emit filemeta, blockmeta in
   addition to block data. This is to make provisions for retaining original
   sequence at the destination.


Consider the expected typical scenario, an upstream operator X sends tuples
> to this proposed operator Y.
> 1. How does Y know what the file name is, given a tuple (i.e.
> implementation of *getFileName()*) ?


Proposed operator Y writes all records into same file. Basically, tuple
doesn't care about which file it came from. Operator mentions about where
to write that tuple. All tuples go to the same output file.
(Simplification, because we do not want getFileName() to be abstract. Also,
this is valid in many use-cases.)


2. How does Y know when to call *requestFinalize()* for a file (multiple
> files could be in progress) ?


As discussed in 1, only one file will be in progress.
*requestFinalize()* call will happens based on time or size of the output
file as discussed in earlier emails on this thread.


3. Is it partitionable ? The base class is not for some reason though the
> file input operator is.

Since base class is not partitionable,  This operator Y will not be
partitionable.


4. The directory where files are written is a fixed property in the base
> class annotated with *@NotNull*; what
>     if this path is not known upfront but is dynamically constructed on a
> per-file basis.
>     How does X send this info to Y ?


Since, there is only single file, there is no concept of dynamically
constructing the file name.

When looking at files, the simplest example a user will think of is file
> copy, so I think we should make
> that work, and work well. To do that, the file input operator may also need
> to be carefully examined
> and changed suitably if necessary.
> I guess addressing it in a module is certainly an option but having file
> input and output operators
> with elaborate features, class hierarchies, and tutorials but where the
> simplest possible use case
> is not easy is doing users a disservice.


Yes. File copy is the simplest example for file source, destinations. The
aim is to make this file copy easy for the end user.
Answer to that lies in the proposal to have dedicated components for file
copy (Component #2, #4 ) as mentioned above.

This email thread is for discussions about component #3. i.e. HDFS output
tuple basis.

~ Yogi

On 6 March 2016 at 22:23, Munagala Ramanath <[email protected]> wrote:

> Yogi, I think I understand the intent. However, in:
>
> "Main use-case being : data is read from some source, processed
> tuple-by-tuple by some operators and then given to this proposed concrete
> operator for writing to HDFS."
>
> Does "from some source" specifically exclude files ? If so, we should
> explicitly state this.
> In my view, we should make the operator as flexible as reasonably possible
> without limiting
> it to particular "use cases".
>
> Consider the expected typical scenario, an upstream operator X sends tuples
> to this proposed operator Y.
> 1. How does Y know what the file name is, given a tuple (i.e.
> implementation of *getFileName()*) ?
> 2. How does Y know when to call *requestFinalize()* for a file (multiple
> files could be in progress) ?
> 3. Is it partitionable ? The base class is not for some reason though the
> file input operator is.
> 4. The directory where files are written is a fixed property in the base
> class annotated with *@NotNull*; what
>     if this path is not known upfront but is dynamically constructed on a
> per-file basis.
>     How does X send this info to Y ?
>
> When looking at files, the simplest example a user will think of is file
> copy, so I think we should make
> that work, and work well. To do that, the file input operator may also need
> to be carefully examined
> and changed suitably if necessary.
>
> I guess addressing it in a module is certainly an option but having file
> input and output operators
> with elaborate features, class hierarchies, and tutorials but where the
> simplest possible use case
> is not easy is doing users a disservice.
>
> Ram
>
>
> On Sun, Mar 6, 2016 at 12:29 AM, Yogi Devendra <
> [email protected]
> > wrote:
>
> > Ram,
> >
> > Aim of this concrete operator is write incoming tuples to HDFS files.
> >
> > Main use-case being : data is read from some source, processed
> > tuple-by-tuple by some operators and then given to this proposed concrete
> > operator for writing to HDFS.
> >
> > As you pointed out, file operation is another common use-case; but we can
> > work out separate mechanism which handles the complexities explained in
> > your post.
> > Priyanka has already posted about proposal for HDFS input module having
> > FileSplitter + BlockReader operator.
> > I will post another proposal for HDFS file copy module which would
> > seamlessly integrate with HDFS input module to solve file copy use-case.
> >
> > Question:
> > Is it acceptable if we have concrete operator (current proposal) for
> > tuple-by-tuple writing and have separate module to take care of file copy
> > use-cases?
> >
> > ~ Yogi
> >
> > On 6 March 2016 at 09:45, Munagala Ramanath <[email protected]> wrote:
> >
> > > Since the AbstractFileInputOperator provides a concrete implementation
> > > (FileLineInputOperator in the same file)
> > > it seems reasonable to have one for the output operator as well.
> > >
> > > Another basic and reasonable requirement is that it should be possible
> to
> > > connect the input and output operators
> > > without any further fussing and get a robust and high performance
> > > application for copying files from source to
> > > destination. There are a number of issues that crop up in doing this
> > > though: The input operator can read and
> > > dispatch tuples from multiple files in the same window; how does it
> tell
> > > the output operator where the file
> > > boundaries are ? Special control tuples sent inline are one
> possibility;
> > > control tuples sent via a separate port
> > > are another. Tagging each tuple with the file name is a third. Each has
> > > additional aspects to consider
> > > such as impact on performance, time skew between multiple input ports,
> > etc.
> > >
> > > Ram
> > >
> > > On Thu, Mar 3, 2016 at 5:51 PM, Yogi Devendra <[email protected]
> >
> > > wrote:
> > >
> > > > Any suggestions/ comments on this?
> > > >
> > > > ~ Yogi
> > > >
> > > > On 3 March 2016 at 17:44, Yogi Devendra <[email protected]>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Currently, for writing to HDFS file we have
> > AbstractFileOutputOperator
> > > in
> > > > > the malhar library.
> > > > >
> > > > > It has following abstract methods :
> > > > > 1. protected abstract String getFileName(INPUT tuple)
> > > > > 2. protected abstract byte[] getBytesForTuple(INPUT tuple)
> > > > >
> > > > > These methods are kept generic to give flexibility to the app
> > > developers.
> > > > > But, someone who is new to apex; would look for ready-made
> > > implementation
> > > > > instead of extending Abstract implementation.
> > > > >
> > > > > Thus, I am proposing to add concrete operator HDFSOutputOperator to
> > > > > malhar. Aim of this operator would be to serve the purpose of ready
> > to
> > > > use
> > > > > operator for most frequent use-cases.
> > > > >
> > > > > Here are my key observations on most frequent use-cases:
> > > > >
> > > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > > >
> > > > > 1. Writing tuples of type byte[] or String.
> > > > > 2. All tuples on a particular stream land up in the same output
> file.
> > > > > 3. App developer may want to add some custom tuple separator (e.g.
> > > > newline
> > > > > character) between tuples.
> > > > >
> > > > > Please mention your comments regarding :
> > > > > --------------------------------------------------------
> > > > >
> > > > > 1. Will it be useful to have such concrete operator?
> > > > >
> > > > > 2. Do you think of any other datatype other than byte[], String
> that
> > > > > should be supported out of the box by this concrete operator?
> > > > > Currently, I am planning to include byte[], String, any other type
> > > having
> > > > > valid toString() as input tuples.
> > > > >
> > > > > 3. Do you think tuple separator should be configurable?
> > > > >
> > > > > 4. Any other feedback?
> > > > >
> > > > >
> > > > > Proposed design:
> > > > > ----------------------
> > > > >
> > > > > 1. This concrete implementation will be extending
> > > > > AbstractFileOutputOperator with default implementation for abstract
> > > > methods
> > > > > mentioned above.
> > > > >
> > > > > 2. Filename , Tuple separator will be exposed as a operator
> property.
> > > > >
> > > > > 3. All incoming tuples will be written to same file mentioned in
> the
> > > > > property.
> > > > >
> > > > > 4. This operator will be added to malhar library under package
> > > > > com.datatorrent.lib.io.fs where AbstractFileOutputOperator resides.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > >
> > >
> >
>

Re: Proposal for concrete operator for writing to HDFS file

Reply via email to