Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Shubham Pathak Thu, 24 Mar 2016 22:45:07 -0700

+1 . Will certainly be a good addition.

On Fri, Mar 25, 2016 at 9:31 AM, Chandni Singh <[email protected]>
wrote:


> +1 for the idea
> On Mar 24, 2016 8:41 PM, "Thomas Weise" <[email protected]> wrote:
>
> > +1 for the idea in general and extending existing implementation.
> >
> > In case this introduces a MapReduce dependency we will also need to
> > consider a separate module.
> >
> > Thomas
> >
> >
> > On Thu, Mar 24, 2016 at 2:35 AM, Devendra Tagare <
> > [email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > We are thinking of extending the FileSplitter and BlockReader .
> > > Changing the existing code could have side effects.
> > >
> > > Thanks,
> > > Dev
> > > On Mar 24, 2016 1:16 AM, "Tushar Gosavi" <[email protected]>
> wrote:
> > >
> > > > My suggestion is to extend from FileSplitter and BlockReader without
> > > > changing them, and add support for InputFormat in derived classes.
> > > > FileSplitter and BlockReader already provides enough hooks to define
> > > splits
> > > > and read records.
> > > >
> > > > - Tushar.
> > > >
> > > >
> > > > On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Aligning FileSplitter, BlockReader with respective counterparts
> from
> > > > > mapreduce will be excellent value addition.
> > > > >
> > > > > IMO, it has 2 advantages:
> > > > >
> > > > > 1. It will allow us to plug-in more formats for
> > > FileSplitter+BlockReader
> > > > > pattern use-cases.
> > > > > 2. It will be easy for end-users coming from mapreduce background
> if
> > > they
> > > > > get something equivalent in Apex.
> > > > >
> > > > > One question:
> > > > > Are you planning to refactor existing FileSplitter, BlockReader OR
> > plan
> > > > is
> > > > > to have this implementation as fresh classes?
> > > > > If these are fresh classes, are we saying that they will eventually
> > > > > deprecate the existing FileSplitter, BlockReader?
> > > > >
> > > > > We have other few other components dependent on existing
> > FileSplitter,
> > > > > BlockReader. Hence, would like to know about future direction for
> > these
> > > > > classes.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > > On 24 March 2016 at 10:47, Priyanka Gugale <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > So as I understand splitter would be format aware, in that case
> > would
> > > > we
> > > > > > need different kinds of parser we have right now? Or the format
> > aware
> > > > > > splitter will take care of parsing different file formats e.g.
> csv
> > > etc?
> > > > > >
> > > > > > -Priyanka
> > > > > >
> > > > > > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> > > > > > [email protected]
> > > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Initiating this thread to get the community's opinion on
> aligning
> > > the
> > > > > > > FileSplitter with InputSplit & the BlockReader with the
> > > RecordReader
> > > > > from
> > > > > > > org.apache.hadoop.mapreduce.InputSplit &
> > > > > > > org.apache.hadoop.mapreduce.RecordReader respectively.
> > > > > > >
> > > > > > > Some more details and rationale on the approach,
> > > > > > >
> > > > > > > InputFormat lets MR create Input Splits ie individual chunks of
> > > > bytes.
> > > > > > > The ability to correctly create these splits is determined by
> the
> > > > Input
> > > > > > > Format itself.eg SequenceFile format or Avro.
> > > > > > >
> > > > > > > Internally these formats are organized as a sequence of
> > blocks.Each
> > > > > block
> > > > > > > can be compressed with a compression codec & it does not matter
> > if
> > > > this
> > > > > > > codec in itself is splittable.
> > > > > > > When they are set as an Input format, the MR framework creates
> > > input
> > > > > > splits
> > > > > > > based on the block boundaries given by the metadata object
> packed
> > > > with
> > > > > > the
> > > > > > > file.
> > > > > > >
> > > > > > > Each InputFormat has a specific block definition. eg for Avro
> the
> > > > block
> > > > > > > definition is as below,
> > > > > > >
> > > > > > > Avro file data block consists of:
> > > > > > >
> > > > > > > A long indicating the count of objects in this block.
> > > > > > > A long indicating the size in bytes of the serialized objects
> in
> > > the
> > > > > > > current block, after any codec is applied
> > > > > > > The serialized objects. If a codec is specified, this is
> > compressed
> > > > by
> > > > > > that
> > > > > > > codec.
> > > > > > > The file's 16-byte sync marker.
> > > > > > > Thus, each block's binary data can be efficiently extracted or
> > > > skipped
> > > > > > > without deserializing the contents. The combination of block
> > size,
> > > > > object
> > > > > > > counts, and sync markers enable detection of corrupt blocks and
> > > help
> > > > > > ensure
> > > > > > > data integrity.
> > > > > > >
> > > > > > > Each map task gets an entire block to read.RecordReader is used
> > to
> > > > read
> > > > > > the
> > > > > > > individual records for the block and generates key,val pairs.
> > > > > > > The records could be fixed length or use a schema as in the
> case
> > of
> > > > > > parquet
> > > > > > > or Avro.
> > > > > > >
> > > > > > > We can extend the BlockReader to work with RecordReader based
> on
> > > the
> > > > > sync
> > > > > > > markers to correctly identify & parse the individual records.
> > > > > > >
> > > > > > > Please send across your thoughts on the same.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Dev
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Reply via email to