pig-user  

RE: How to access filenames after loading a directory to an Alias [pig scripting]

Olga Natkovich
Mon, 06 Oct 2008 14:50:31 -0700

That is true. Pig currently does not support that.

Olga 

> -----Original Message-----
> From: Latha [EMAIL PROTECTED] 
> Sent: Monday, October 06, 2008 11:52 AM
> To: pig-user@incubator.apache.org
> Subject: Re: How to access filenames after loading a 
> directory to an Alias [pig scripting]
> 
> Hi Olga,
> 
> How can I achieve loading individual files from a directory 
> structure at grunt shell?
> 
> "bin/hadoop dfs -lsr"  lists all the files in a hdfs 
> irrespective of the depth of the file in directories.
> [it also lists directories :(   ]
> 
> However, PIG grunt shell supports  dfs "ls" command , and  not "lsr"
> command.Here, its not
> possible to get all the filenames. It lists only the toplevel 
> directories or files available at hdfs.
> Please correct me if wrong.
> 
> Rgds,
> Srilatha
> 
> 
> On Mon, Oct 6, 2008 at 9:11 PM, Olga Natkovich 
> <[EMAIL PROTECTED]> wrote:
> 
> > Metadata like filename is not preserved when the data is 
> loaded. You 
> > can load individual files and then use union command but 
> that will run 
> > slower because of extra processing steps.
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: Latha [EMAIL PROTECTED]
> > > Sent: Sunday, October 05, 2008 10:35 AM
> > > To: pig-user@incubator.apache.org
> > > Subject: How to access filenames after loading a directory to an 
> > > Alias [pig scripting]
> > >
> > > Greetings!
> > > Hi , When I load a directory(from hdfs)  into an alias and try to 
> > > dump it, I find all the lines of various files in that directory 
> > > appearing one after another.
> > > However, not able to figure out how to access filenames 
> from alias. 
> > > Tried understanding script1-hadoop.pig. Still ,am not 
> able to find 
> > > out the same.
> > >
> > > A = load "inputDir" using PigStorage(); dump A;
> > > Output:
> > > ------------------------------------------------
> > > ( line1 from inputDir/insideDir/file1.txt) ( line 2 from
> > > inputDir/insideDir/file1.txt) .
> > > (line 1 from inputDir/insideDir/innermost/fileone.txt)
> > > ...
> > > etc.,
> > > ------------------------------------------------
> > >
> > > Am interested in filewise results , where I can retain 
> the filename 
> > > and get the results filewise.
> > >
> > > filename1
> > > ( line1 )
> > > ( line2 )
> > >
> > > filename2
> > > (line 1)
> > > (line 2)
> > > etc.,
> > >
> > > Is there any way I can access filenames from alias to which a 
> > > directory is loaded? Requirement is to iterate through all the 
> > > files, and in each file, would like to process every line. please 
> > > point me the right approach.
> > >
> > > Regards,
> > > Srilatha
> > >
> >
>