pig-user  

Re: Custom load function?

jiang licht
Mon, 15 Mar 2010 17:58:59 -0700

Thanks, Alan. 

That is what we are doing right now. But sometimes, we only want to include 
some files in one folder and you cannot simply use regular expression on file 
names to separate what you want from what you don't want. That's why we want a 
generic solution.

This is helpful since it's often reasonable to keep only one copy of a big data 
set. Then when someone needs to do some analysis on a subset, he only needs to 
fill out  a list of files in the subset and uses the load function to load them 
from the list (symlink may do the job but not available in fs shell). From 
previous post, this seems to be simple. But I haven't found time to actually 
look at how to write such a function. Is there some sample code out there and 
any hints for doing this? 

Thanks!

Michael

--- On Mon, 3/15/10, Alan Gates <ga...@yahoo-inc.com> wrote:

From: Alan Gates <ga...@yahoo-inc.com>
Subject: Re: Custom load function?
To: pig-user@hadoop.apache.org
Date: Monday, March 15, 2010, 1:45 PM

PigStorage (the default load function) takes Hadoop regular expressions.  So as 
long as you can express these files in a valid Hadoop regular expression it 
should work fine.

Alan.

On Mar 9, 2010, at 7:56 PM, jiang licht wrote:

> Before I read the example, here's a simple thing that I want to know how to 
> implement but not sure: I have a list of files which are scattered in 
> different folders in a hadoop cluster, instead of firing multiple "load" to 
> read each file, I want to put the full path names of these files on a list 
> and then have a load function that can take the file name of the list as an 
> argument and then load these files ...
> 
> Thanks,
> 
> Michael
>