I am planning to put this module in malhar-library project in package: com.datatorrent.lib.io.fs Let me know if this is acceptable?
-Priyanka On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <[email protected]> wrote: > I haven't created any branch yet, should share it with you as soon as I > add the code for module. > Surely would be happy to help :) > > -Priyanka > > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <[email protected]> > wrote: > >> Priyanka, >> >> Thanks for the update. I will consider these ports during the design phase >> of my proposal for HDFS file copy module. >> >> I believe you are planning to add this to Apex Malhar. Please post any >> link >> / private branch (if any) where I can have a look at the first cut. >> >> I will ask for your help if I come across any questions, uncertainties >> etc. >> >> ~ Yogi >> >> On 23 February 2016 at 17:59, Priyanka Gugale <[email protected]> >> wrote: >> >> > I am planning to have following ports to this module: >> > >> > Ports >> > Input port: None >> > >> > Output port: >> > >> > 1. FileMetadata >> > 2. BlockMetadata >> > 3. Block bytes >> > >> > -Priyanka >> > >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <[email protected] >> > >> > wrote: >> > >> > > Priyanka, >> > > >> > > Can you please share details about what would be the output ports from >> > this >> > > module? >> > > >> > > I am thinking of HDFS File Copy Module which can be used in >> conjunction >> > > with this module to copy files from HDFS to HDFS. >> > > >> > > ~ Yogi >> > > >> > > On 18 February 2016 at 10:29, Mohit Jotwani <[email protected]> >> > wrote: >> > > >> > > > +1 to add this. >> > > > >> > > > Regards, >> > > > Mohit >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <[email protected]> >> > wrote: >> > > > >> > > > > +1 to add this module >> > > > > >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale < >> > > > [email protected] >> > > > > > >> > > > > wrote: >> > > > > >> > > > > > We need partitions for parallel read but how will the reader >> > > partition >> > > > > know >> > > > > > which offset of the file it should read from. Normally >> FileSplitter >> > > > > creates >> > > > > > this metadata, let's call them as reader task, and forwards >> them to >> > > > next >> > > > > > operator which is block reader. Block reader will receive one of >> > the >> > > > > tasks >> > > > > > and read from specified offset in file. If FileSplitter is >> absent >> > one >> > > > > > reader partition will have to consume one file entirely, which >> > means >> > > we >> > > > > > can't have parallel reading over one file. I hope this answers >> your >> > > > > > question. >> > > > > > >> > > > > > Advantage of having this module is having a reusable component >> made >> > > up >> > > > of >> > > > > > operators which are frequently used together to do file reading. >> > > > > > >> > > > > > -Priyanka >> > > > > > >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra < >> > > > [email protected] >> > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Let me rephrase Ram's question to make it clear: >> > > > > > > >> > > > > > > For an application developer using Malhar: >> > > > > > > What are the advantages / disadvantages of using the proposed >> > HDFS >> > > > File >> > > > > > > input Module as compared to directly using FileSplitter, >> > > BlockReader >> > > > > > > Operators available in Malhar? >> > > > > > > >> > > > > > > ~ Yogi >> > > > > > > >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath < >> > > [email protected] >> > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Can parallel read not be achieved by partitioning ? >> > > > > > > > >> > > > > > > > Ram >> > > > > > > > >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale < >> > > > > > > [email protected] >> > > > > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hi, >> > > > > > > > > >> > > > > > > > > It is a common usecase to read big files on HDFS in >> parallel >> > > > > fashion >> > > > > > > i.e. >> > > > > > > > > many reader thread are used to read the file in parallel. >> We >> > > can >> > > > > > > achieve >> > > > > > > > > this on top of Apex using following Malhar operators: >> > > > > > > > > >> > > > > > > > > 1. AbstractFileSplitter >> > > > > > > > > 2. AbstractBlockReader >> > > > > > > > > >> > > > > > > > > where FileSplitter, as per file metadata, creates small >> > reader >> > > > > > tasks(to >> > > > > > > > > read file in parts). Those reader tasks are run by >> > BlockReaders >> > > > in >> > > > > > > > parallel >> > > > > > > > > to read the file. >> > > > > > > > > >> > > > > > > > > As these operators are generally used together to achieve >> > file >> > > > read >> > > > > > > > > operation, I propose we create a module, called >> > HDFSFileReader >> > > > for >> > > > > > > this. >> > > > > > > > > >> > > > > > > > > Please provide your suggestions on same. >> > > > > > > > > >> > > > > > > > > -Priyanka >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
