Re: HDFS File Reader Module

Priyanka Gugale Wed, 02 Mar 2016 03:26:20 -0800

I am planning to put this module in malhar-library project in
package: com.datatorrent.lib.io.fs
Let me know if this is acceptable?


-Priyanka

On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <[email protected]>
wrote:

> I haven't created any branch yet, should share it with you as soon as I
> add the code for module.
> Surely would be happy to help :)
>
> -Priyanka
>
> On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <[email protected]>
> wrote:
>
>> Priyanka,
>>
>> Thanks for the update. I will consider these ports during the design phase
>> of my proposal for HDFS file copy module.
>>
>> I believe you are planning to add this to Apex Malhar. Please post any
>> link
>> / private branch (if any) where I can have a look at the first cut.
>>
>> I will ask for your help if I come across any questions, uncertainties
>> etc.
>>
>> ~ Yogi
>>
>> On 23 February 2016 at 17:59, Priyanka Gugale <[email protected]>
>> wrote:
>>
>> > I am planning to have following ports to this module:
>> >
>> > Ports
>> > Input port: None
>> >
>> > Output port:
>> >
>> >    1. FileMetadata
>> >    2. BlockMetadata
>> >    3. Block bytes
>> >
>> > -Priyanka
>> >
>> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <[email protected]
>> >
>> > wrote:
>> >
>> > > Priyanka,
>> > >
>> > > Can you please share details about what would be the output ports from
>> > this
>> > > module?
>> > >
>> > > I am thinking of HDFS File Copy Module which can be used in
>> conjunction
>> > > with this module to copy files from HDFS to HDFS.
>> > >
>> > > ~ Yogi
>> > >
>> > > On 18 February 2016 at 10:29, Mohit Jotwani <[email protected]>
>> > wrote:
>> > >
>> > > > +1 to add this.
>> > > >
>> > > > Regards,
>> > > > Mohit
>> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <[email protected]>
>> > wrote:
>> > > >
>> > > > > +1 to add this module
>> > > > >
>> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
>> > > > [email protected]
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > We need partitions for parallel read but how will the reader
>> > > partition
>> > > > > know
>> > > > > > which offset of the file it should read from. Normally
>> FileSplitter
>> > > > > creates
>> > > > > > this metadata, let's call them as reader task, and forwards
>> them to
>> > > > next
>> > > > > > operator which is block reader. Block reader will receive one of
>> > the
>> > > > > tasks
>> > > > > > and read from specified offset in file. If FileSplitter is
>> absent
>> > one
>> > > > > > reader partition will have to consume one file entirely, which
>> > means
>> > > we
>> > > > > > can't have parallel reading over one file. I hope this answers
>> your
>> > > > > > question.
>> > > > > >
>> > > > > > Advantage of having this module is having a reusable component
>> made
>> > > up
>> > > > of
>> > > > > > operators which are frequently used together to do file reading.
>> > > > > >
>> > > > > > -Priyanka
>> > > > > >
>> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
>> > > > [email protected]
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Let me rephrase Ram's question to make it clear:
>> > > > > > >
>> > > > > > > For an application developer using Malhar:
>> > > > > > > What are the advantages / disadvantages of using the proposed
>> > HDFS
>> > > > File
>> > > > > > > input Module as compared to directly using FileSplitter,
>> > > BlockReader
>> > > > > > > Operators available in Malhar?
>> > > > > > >
>> > > > > > > ~ Yogi
>> > > > > > >
>> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
>> > > [email protected]
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Can parallel read not be achieved by partitioning ?
>> > > > > > > >
>> > > > > > > > Ram
>> > > > > > > >
>> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
>> > > > > > > [email protected]
>> > > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > It is a common usecase to read big files on HDFS in
>> parallel
>> > > > > fashion
>> > > > > > > i.e.
>> > > > > > > > > many reader thread are used to read the file in parallel.
>> We
>> > > can
>> > > > > > > achieve
>> > > > > > > > > this on top of Apex using following Malhar operators:
>> > > > > > > > >
>> > > > > > > > > 1. AbstractFileSplitter
>> > > > > > > > > 2. AbstractBlockReader
>> > > > > > > > >
>> > > > > > > > > where FileSplitter, as per file metadata, creates small
>> > reader
>> > > > > > tasks(to
>> > > > > > > > > read file in parts). Those reader tasks are run by
>> > BlockReaders
>> > > > in
>> > > > > > > > parallel
>> > > > > > > > > to read the file.
>> > > > > > > > >
>> > > > > > > > > As these operators are generally used together to achieve
>> > file
>> > > > read
>> > > > > > > > > operation, I propose we create a module, called
>> > HDFSFileReader
>> > > > for
>> > > > > > > this.
>> > > > > > > > >
>> > > > > > > > > Please provide your suggestions on same.
>> > > > > > > > >
>> > > > > > > > > -Priyanka
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: HDFS File Reader Module

Reply via email to