pig-user  

Re: How to access filenames after loading a directory to an Alias [pig scripting]

Latha
Mon, 06 Oct 2008 11:52:42 -0700

Hi Olga,

How can I achieve loading individual files from a directory structure at
grunt shell?

"bin/hadoop dfs -lsr"  lists all the files in a hdfs irrespective of the
depth of the file in directories.
[it also lists directories :(   ]

However, PIG grunt shell supports  dfs "ls" command , and  not "lsr"
command.Here, its not
possible to get all the filenames. It lists only the toplevel directories or
files available at hdfs.
Please correct me if wrong.

Rgds,
Srilatha


On Mon, Oct 6, 2008 at 9:11 PM, Olga Natkovich <[EMAIL PROTECTED]> wrote:

> Metadata like filename is not preserved when the data is loaded. You can
> load individual files and then use union command but that will run
> slower because of extra processing steps.
>
> Olga
>
> > -----Original Message-----
> > From: Latha [EMAIL PROTECTED]
> > Sent: Sunday, October 05, 2008 10:35 AM
> > To: pig-user@incubator.apache.org
> > Subject: How to access filenames after loading a directory to
> > an Alias [pig scripting]
> >
> > Greetings!
> > Hi , When I load a directory(from hdfs)  into an alias and
> > try to dump it, I find all the lines of various files in that
> > directory appearing one after another.
> > However, not able to figure out how to access filenames from
> > alias. Tried understanding script1-hadoop.pig. Still ,am not
> > able to find out the same.
> >
> > A = load "inputDir" using PigStorage();
> > dump A;
> > Output:
> > ------------------------------------------------
> > ( line1 from inputDir/insideDir/file1.txt) ( line 2 from
> > inputDir/insideDir/file1.txt) .
> > (line 1 from inputDir/insideDir/innermost/fileone.txt)
> > ...
> > etc.,
> > ------------------------------------------------
> >
> > Am interested in filewise results , where I can retain the
> > filename and get the results filewise.
> >
> > filename1
> > ( line1 )
> > ( line2 )
> >
> > filename2
> > (line 1)
> > (line 2)
> > etc.,
> >
> > Is there any way I can access filenames from alias to which a
> > directory is loaded? Requirement is to iterate through all
> > the files, and in each file, would like to process every
> > line. please point me the right approach.
> >
> > Regards,
> > Srilatha
> >
>