jiang licht
Mon, 15 Mar 2010 17:58:59 -0700
Thanks, Alan.
That is what we are doing right now. But sometimes, we only want to include
some files in one folder and you cannot simply use regular expression on file
names to separate what you want from what you don't want. That's why we want a
generic solution.
This is helpful since it's often reasonable to keep only one copy of a big data
set. Then when someone needs to do some analysis on a subset, he only needs to
fill out a list of files in the subset and uses the load function to load them
from the list (symlink may do the job but not available in fs shell). From
previous post, this seems to be simple. But I haven't found time to actually
look at how to write such a function. Is there some sample code out there and
any hints for doing this?
Thanks!
Michael
--- On Mon, 3/15/10, Alan Gates <ga...@yahoo-inc.com> wrote:
From: Alan Gates <ga...@yahoo-inc.com>
Subject: Re: Custom load function?
To: pig-user@hadoop.apache.org
Date: Monday, March 15, 2010, 1:45 PM
PigStorage (the default load function) takes Hadoop regular expressions. So as
long as you can express these files in a valid Hadoop regular expression it
should work fine.
Alan.
On Mar 9, 2010, at 7:56 PM, jiang licht wrote:
> Before I read the example, here's a simple thing that I want to know how to
> implement but not sure: I have a list of files which are scattered in
> different folders in a hadoop cluster, instead of firing multiple "load" to
> read each file, I want to put the full path names of these files on a list
> and then have a load function that can take the file name of the list as an
> argument and then load these files ...
>
> Thanks,
>
> Michael
>