+1 for seperate namespace for modules. On Thu, Mar 3, 2016 at 10:58 AM, Priyanka Gugale <[email protected]> wrote:
> That is also a option but then I have a question, do we want to treat > modules separately or it is just a type of operator, may be a super > operator? > Also I believe it would be good if we have feature wise packages than using > our custom terms to create package, so anyone can easily locate the > classes. > > > -Priyanka > > On Thu, Mar 3, 2016 at 12:20 AM, Sandesh Hegde <[email protected]> > wrote: > > > My vote is to have a separate namespace for modules. > > > > Is it time to introduce > > org.apache.apex.module.io.fs ? > > > > On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <[email protected] > > > > wrote: > > > > > I am planning to put this module in malhar-library project in > > > package: com.datatorrent.lib.io.fs > > > Let me know if this is acceptable? > > > > > > -Priyanka > > > > > > On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale < > > [email protected] > > > > > > > wrote: > > > > > > > I haven't created any branch yet, should share it with you as soon > as I > > > > add the code for module. > > > > Surely would be happy to help :) > > > > > > > > -Priyanka > > > > > > > > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra < > > [email protected]> > > > > wrote: > > > > > > > >> Priyanka, > > > >> > > > >> Thanks for the update. I will consider these ports during the design > > > phase > > > >> of my proposal for HDFS file copy module. > > > >> > > > >> I believe you are planning to add this to Apex Malhar. Please post > any > > > >> link > > > >> / private branch (if any) where I can have a look at the first cut. > > > >> > > > >> I will ask for your help if I come across any questions, > uncertainties > > > >> etc. > > > >> > > > >> ~ Yogi > > > >> > > > >> On 23 February 2016 at 17:59, Priyanka Gugale < > > [email protected] > > > > > > > >> wrote: > > > >> > > > >> > I am planning to have following ports to this module: > > > >> > > > > >> > Ports > > > >> > Input port: None > > > >> > > > > >> > Output port: > > > >> > > > > >> > 1. FileMetadata > > > >> > 2. BlockMetadata > > > >> > 3. Block bytes > > > >> > > > > >> > -Priyanka > > > >> > > > > >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra < > > > [email protected] > > > >> > > > > >> > wrote: > > > >> > > > > >> > > Priyanka, > > > >> > > > > > >> > > Can you please share details about what would be the output > ports > > > from > > > >> > this > > > >> > > module? > > > >> > > > > > >> > > I am thinking of HDFS File Copy Module which can be used in > > > >> conjunction > > > >> > > with this module to copy files from HDFS to HDFS. > > > >> > > > > > >> > > ~ Yogi > > > >> > > > > > >> > > On 18 February 2016 at 10:29, Mohit Jotwani < > > [email protected]> > > > >> > wrote: > > > >> > > > > > >> > > > +1 to add this. > > > >> > > > > > > >> > > > Regards, > > > >> > > > Mohit > > > >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" < > [email protected] > > > > > > >> > wrote: > > > >> > > > > > > >> > > > > +1 to add this module > > > >> > > > > > > > >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale < > > > >> > > > [email protected] > > > >> > > > > > > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > > > We need partitions for parallel read but how will the > reader > > > >> > > partition > > > >> > > > > know > > > >> > > > > > which offset of the file it should read from. Normally > > > >> FileSplitter > > > >> > > > > creates > > > >> > > > > > this metadata, let's call them as reader task, and > forwards > > > >> them to > > > >> > > > next > > > >> > > > > > operator which is block reader. Block reader will receive > > one > > > of > > > >> > the > > > >> > > > > tasks > > > >> > > > > > and read from specified offset in file. If FileSplitter is > > > >> absent > > > >> > one > > > >> > > > > > reader partition will have to consume one file entirely, > > which > > > >> > means > > > >> > > we > > > >> > > > > > can't have parallel reading over one file. I hope this > > answers > > > >> your > > > >> > > > > > question. > > > >> > > > > > > > > >> > > > > > Advantage of having this module is having a reusable > > component > > > >> made > > > >> > > up > > > >> > > > of > > > >> > > > > > operators which are frequently used together to do file > > > reading. > > > >> > > > > > > > > >> > > > > > -Priyanka > > > >> > > > > > > > > >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra < > > > >> > > > [email protected] > > > >> > > > > > > > > >> > > > > > wrote: > > > >> > > > > > > > > >> > > > > > > Let me rephrase Ram's question to make it clear: > > > >> > > > > > > > > > >> > > > > > > For an application developer using Malhar: > > > >> > > > > > > What are the advantages / disadvantages of using the > > > proposed > > > >> > HDFS > > > >> > > > File > > > >> > > > > > > input Module as compared to directly using FileSplitter, > > > >> > > BlockReader > > > >> > > > > > > Operators available in Malhar? > > > >> > > > > > > > > > >> > > > > > > ~ Yogi > > > >> > > > > > > > > > >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath < > > > >> > > [email protected] > > > >> > > > > > > > >> > > > > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > Can parallel read not be achieved by partitioning ? > > > >> > > > > > > > > > > >> > > > > > > > Ram > > > >> > > > > > > > > > > >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale < > > > >> > > > > > > [email protected] > > > >> > > > > > > > > > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > Hi, > > > >> > > > > > > > > > > > >> > > > > > > > > It is a common usecase to read big files on HDFS in > > > >> parallel > > > >> > > > > fashion > > > >> > > > > > > i.e. > > > >> > > > > > > > > many reader thread are used to read the file in > > > parallel. > > > >> We > > > >> > > can > > > >> > > > > > > achieve > > > >> > > > > > > > > this on top of Apex using following Malhar > operators: > > > >> > > > > > > > > > > > >> > > > > > > > > 1. AbstractFileSplitter > > > >> > > > > > > > > 2. AbstractBlockReader > > > >> > > > > > > > > > > > >> > > > > > > > > where FileSplitter, as per file metadata, creates > > small > > > >> > reader > > > >> > > > > > tasks(to > > > >> > > > > > > > > read file in parts). Those reader tasks are run by > > > >> > BlockReaders > > > >> > > > in > > > >> > > > > > > > parallel > > > >> > > > > > > > > to read the file. > > > >> > > > > > > > > > > > >> > > > > > > > > As these operators are generally used together to > > > achieve > > > >> > file > > > >> > > > read > > > >> > > > > > > > > operation, I propose we create a module, called > > > >> > HDFSFileReader > > > >> > > > for > > > >> > > > > > > this. > > > >> > > > > > > > > > > > >> > > > > > > > > Please provide your suggestions on same. > > > >> > > > > > > > > > > > >> > > > > > > > > -Priyanka > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >
