Re: HDFS File Reader Module

Chinmay Kolhatkar Wed, 02 Mar 2016 21:59:55 -0800

+1 for seperate namespace for modules.

On Thu, Mar 3, 2016 at 10:58 AM, Priyanka Gugale <[email protected]>
wrote:


> That is also a option but then I have a question, do we want to treat
> modules separately or it is just a type of operator, may be a super
> operator?
> Also I believe it would be good if we have feature wise packages than using
> our custom terms to create package, so anyone can easily locate the
> classes.
>
>
> -Priyanka
>
> On Thu, Mar 3, 2016 at 12:20 AM, Sandesh Hegde <[email protected]>
> wrote:
>
> > My vote is to have a separate namespace for modules.
> >
> > Is it time to introduce
> > org.apache.apex.module.io.fs ?
> >
> > On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <[email protected]
> >
> > wrote:
> >
> > > I am planning to put this module in malhar-library project in
> > > package: com.datatorrent.lib.io.fs
> > > Let me know if this is acceptable?
> > >
> > > -Priyanka
> > >
> > > On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > I haven't created any branch yet, should share it with you as soon
> as I
> > > > add the code for module.
> > > > Surely would be happy to help :)
> > > >
> > > > -Priyanka
> > > >
> > > > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <
> > [email protected]>
> > > > wrote:
> > > >
> > > >> Priyanka,
> > > >>
> > > >> Thanks for the update. I will consider these ports during the design
> > > phase
> > > >> of my proposal for HDFS file copy module.
> > > >>
> > > >> I believe you are planning to add this to Apex Malhar. Please post
> any
> > > >> link
> > > >> / private branch (if any) where I can have a look at the first cut.
> > > >>
> > > >> I will ask for your help if I come across any questions,
> uncertainties
> > > >> etc.
> > > >>
> > > >> ~ Yogi
> > > >>
> > > >> On 23 February 2016 at 17:59, Priyanka Gugale <
> > [email protected]
> > > >
> > > >> wrote:
> > > >>
> > > >> > I am planning to have following ports to this module:
> > > >> >
> > > >> > Ports
> > > >> > Input port: None
> > > >> >
> > > >> > Output port:
> > > >> >
> > > >> >    1. FileMetadata
> > > >> >    2. BlockMetadata
> > > >> >    3. Block bytes
> > > >> >
> > > >> > -Priyanka
> > > >> >
> > > >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <
> > > [email protected]
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > Priyanka,
> > > >> > >
> > > >> > > Can you please share details about what would be the output
> ports
> > > from
> > > >> > this
> > > >> > > module?
> > > >> > >
> > > >> > > I am thinking of HDFS File Copy Module which can be used in
> > > >> conjunction
> > > >> > > with this module to copy files from HDFS to HDFS.
> > > >> > >
> > > >> > > ~ Yogi
> > > >> > >
> > > >> > > On 18 February 2016 at 10:29, Mohit Jotwani <
> > [email protected]>
> > > >> > wrote:
> > > >> > >
> > > >> > > > +1 to add this.
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Mohit
> > > >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <
> [email protected]
> > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > +1 to add this module
> > > >> > > > >
> > > >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> > > >> > > > [email protected]
> > > >> > > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > We need partitions for parallel read but how will the
> reader
> > > >> > > partition
> > > >> > > > > know
> > > >> > > > > > which offset of the file it should read from. Normally
> > > >> FileSplitter
> > > >> > > > > creates
> > > >> > > > > > this metadata, let's call them as reader task, and
> forwards
> > > >> them to
> > > >> > > > next
> > > >> > > > > > operator which is block reader. Block reader will receive
> > one
> > > of
> > > >> > the
> > > >> > > > > tasks
> > > >> > > > > > and read from specified offset in file. If FileSplitter is
> > > >> absent
> > > >> > one
> > > >> > > > > > reader partition will have to consume one file entirely,
> > which
> > > >> > means
> > > >> > > we
> > > >> > > > > > can't have parallel reading over one file. I hope this
> > answers
> > > >> your
> > > >> > > > > > question.
> > > >> > > > > >
> > > >> > > > > > Advantage of having this module is having a reusable
> > component
> > > >> made
> > > >> > > up
> > > >> > > > of
> > > >> > > > > > operators which are frequently used together to do file
> > > reading.
> > > >> > > > > >
> > > >> > > > > > -Priyanka
> > > >> > > > > >
> > > >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> > > >> > > > [email protected]
> > > >> > > > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Let me rephrase Ram's question to make it clear:
> > > >> > > > > > >
> > > >> > > > > > > For an application developer using Malhar:
> > > >> > > > > > > What are the advantages / disadvantages of using the
> > > proposed
> > > >> > HDFS
> > > >> > > > File
> > > >> > > > > > > input Module as compared to directly using FileSplitter,
> > > >> > > BlockReader
> > > >> > > > > > > Operators available in Malhar?
> > > >> > > > > > >
> > > >> > > > > > > ~ Yogi
> > > >> > > > > > >
> > > >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
> > > >> > > [email protected]
> > > >> > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Can parallel read not be achieved by partitioning ?
> > > >> > > > > > > >
> > > >> > > > > > > > Ram
> > > >> > > > > > > >
> > > >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> > > >> > > > > > > [email protected]
> > > >> > > > > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi,
> > > >> > > > > > > > >
> > > >> > > > > > > > > It is a common usecase to read big files on HDFS in
> > > >> parallel
> > > >> > > > > fashion
> > > >> > > > > > > i.e.
> > > >> > > > > > > > > many reader thread are used to read the file in
> > > parallel.
> > > >> We
> > > >> > > can
> > > >> > > > > > > achieve
> > > >> > > > > > > > > this on top of Apex using following Malhar
> operators:
> > > >> > > > > > > > >
> > > >> > > > > > > > > 1. AbstractFileSplitter
> > > >> > > > > > > > > 2. AbstractBlockReader
> > > >> > > > > > > > >
> > > >> > > > > > > > > where FileSplitter, as per file metadata, creates
> > small
> > > >> > reader
> > > >> > > > > > tasks(to
> > > >> > > > > > > > > read file in parts). Those reader tasks are run by
> > > >> > BlockReaders
> > > >> > > > in
> > > >> > > > > > > > parallel
> > > >> > > > > > > > > to read the file.
> > > >> > > > > > > > >
> > > >> > > > > > > > > As these operators are generally used together to
> > > achieve
> > > >> > file
> > > >> > > > read
> > > >> > > > > > > > > operation, I propose we create a module, called
> > > >> > HDFSFileReader
> > > >> > > > for
> > > >> > > > > > > this.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Please provide your suggestions on same.
> > > >> > > > > > > > >
> > > >> > > > > > > > > -Priyanka
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: HDFS File Reader Module

Reply via email to