pig-user  

Re: Custom load function?

jiang licht
Mon, 15 Mar 2010 18:25:46 -0700

Thanks zaki. Simply speaking, I want to put file names (some are written in 
regular expressions and some are directory globs, e.g.) on a list and pass the 
file name of the list to pig script, then the custom load function will read 
the list and take care of the actual loading job. In this way, although 
different ppl will have to use different pattern matching strings or directory 
globs to name their data, the interface to each pig script is always the same 
sine it only sees the file list. No regular expression or directory globs will 
go into pig script, so no need to modify your script if you decide to use 
different data set from time to time.

Thanks,

Michael

--- On Mon, 3/15/10, zaki rahaman <zaki.raha...@gmail.com> wrote:

From: zaki rahaman <zaki.raha...@gmail.com>
Subject: Re: Custom load function?
To: pig-user@hadoop.apache.org
Date: Monday, March 15, 2010, 8:08 PM

I'm not sure I understand. Of course you can specify partial filename
matches or patterns as well as directory globs.... I'm not sure why you need
to reinvent the wheel here so to speak.

On Mon, Mar 15, 2010 at 8:58 PM, jiang licht <licht_ji...@yahoo.com> wrote:

> Thanks, Alan.
>
> That is what we are doing right now. But sometimes, we only want to include
> some files in one folder and you cannot simply use regular expression on
> file names to separate what you want from what you don't want. That's why we
> want a generic solution.
>
> This is helpful since it's often reasonable to keep only one copy of a big
> data set. Then when someone needs to do some analysis on a subset, he only
> needs to fill out  a list of files in the subset and uses the load function
> to load them from the list (symlink may do the job but not available in fs
> shell). From previous post, this seems to be simple. But I haven't found
> time to actually look at how to write such a function. Is there some sample
> code out there and any hints for doing this?
>
> Thanks!
>
> Michael
>
> --- On Mon, 3/15/10, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> From: Alan Gates <ga...@yahoo-inc.com>
> Subject: Re: Custom load function?
> To: pig-user@hadoop.apache.org
> Date: Monday, March 15, 2010, 1:45 PM
>
> PigStorage (the default load function) takes Hadoop regular expressions.
> So as long as you can express these files in a valid Hadoop regular
> expression it should work fine.
>
> Alan.
>
> On Mar 9, 2010, at 7:56 PM, jiang licht wrote:
>
> > Before I read the example, here's a simple thing that I want to know how
> to implement but not sure: I have a list of files which are scattered in
> different folders in a hadoop cluster, instead of firing multiple "load" to
> read each file, I want to put the full path names of these files on a list
> and then have a load function that can take the file name of the list as an
> argument and then load these files ...
> >
> > Thanks,
> >
> > Michael
> >
>
>
>
>
>
>



-- 
Zaki Rahaman