Re: Pig loader 0.6 to 0.7 migration guide

Ankur C. Goel Thu, 03 Jun 2010 04:06:06 -0700

A similar need is being expressed by zebra folks here - 
https://issues.apache.org/jira/browse/PIG-1337.
You might want to comment/vote on it as it is scheduled for 0.8 release.


Loading data in prepareToRead() is fine. For a workaround I think it should be 
ok to read the data directly from HDFS in each of the mappers provided you 
aren't doing any costly namespace operations like 'listStatus' that can stress 
the namesystem in the event of thousands of tasks executing it concurrently.

Regards
-...@nkur

 6/2/10 10:36 PM, "Scott Carey" <[email protected]> wrote:



On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:

> Scott,
>       You can set hadoop properties at the time of running your pig script 
> with -D option. So
> pig -Dhadoop.property.name=something myscript essentially sets the property 
> in the job configuration.
>

So no programatic configuration of hadoop properties is allowed (where its 
easier to control) but its allowable to set it at the script level?  I guess I 
can do that, but it complicates things.
Also this is a very poor way to do this.  My script has 600 lines of Pig and 
~45 M/R jobs.  Only three of the jobs need the distributed cache, not all 45.

> Speaking specifically of utilizing the distributed cache feature, you can 
> just set the filename in LoadFunc constructor and then load the data in 
> memory in getNext() method if not already loaded.
>

That is what the original idea was.

> Here is the pig command to set up the distributed cache
>
> pig 
> -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name
>    ---> This name needs to be passed to UDF constructor so that its available 
> in mapper/reducer's working dir on compute node.
>       -Dmapred.create.symlink=yes
>       script.pig

If that property is set, then constructor only needs file-name (the symlink) 
right?  Right now I'm trying to set those properties using the DistributedCache 
static interfaces which means I need to have access to the full path.

>
> Implement something like a loadData() method that loads the data only once 
> and invoke it from getNext() method. The script will work even in the local 
> mode if the file distributed via distributed cache resides in the CWD from 
> where script is invoked.
>

I'm loading the data in prepareToRead(), which seems most appropriate.  Do you 
see any problem with that?

> Hope that's helpful.

I think the command line property hack is insufficient.  I am left with a 
choice of having a couple jobs read the file from HDFS directly in their 
mappers, or having all jobs unnecessarily set up distributed cache.  Job setup 
time is already 1/4 of my processing time.
Is there a feature request for Load/Store access to Hadoop job configuration 
properties?

Ideally, this would be a method on LoadFunc that passes a modifiable 
Configuration object in on the front-end, or a callback for a user to 
optionally provide a Configuration object with the few properties you want to 
alter in it that Pig can apply to the real thing before it configures its 
properties.

Thanks for the info Ankur,

-Scott

>
> -...@nkur
>
> On 6/2/10 2:53 PM, "Scott Carey" <[email protected]> wrote:
>
> So, here are some things I'm struggling with now:
>
> In a LoadFunc, If I want to load something into DistributedCache.  The path 
> is passed into the LoadFunc constructor as an argument.
> Documentation on getSchema() and all other metadata methods state that you 
> can't modify the job or its configuration passed in.  I've verified that 
> changes to the Configuration are ignored if set here.
>
> It appears that I could set these properties in setLocation() but that is 
> called a lot on the back-end too, and the documentation does not state if 
> setLocation() is called at all on the front-end.  Based on my experimental 
> results, it doesn't seem to.
> Is there no way to modify Hadoop properties on the front-end to utilize 
> hadoop features?  UDFContext seems completely useless for setting hadoop 
> properties for things other than the UDF itself -- like distributed cache 
> settings.  A stand-alone front-end hook for this would be great.  Otherwise, 
> any hack that works would be acceptable for now.
>
>
> * The documentation for LoadMetadata can use some information about when each 
> method gets called -- front end only?  Between what other calls?
> * UDFContext's documentation needs help too --
> ** addJobConf() is public, but not expected to be used by end-users, right?  
> Several public methods here look like they need better documentation, and the 
> class itself could use a javadoc entry with some example uses.
>
>
> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
>
>> Scott,
>>
>> I made an effort to address the documentation in 
>> https://issues.apache.org/jira/browse/PIG-1370
>> If you have a chance take a look and let me know if it deals with
>> the issues you have or if more work is needed.
>>
>> Alan.
>>
>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>>
>>> I have been using these documents for a couple weeks, implementing
>>> various store and load functionality, and they have been very helpful.
>>>
>>> However, there is room for improvement.  What is most unclear is
>>> when the API methods get called.  Each method should clearly state
>>> in these documents (and the javadoc) when it is called -- front-end
>>> only? back-end only?  both?  Sometimes this is obvious, other times
>>> it is not.
>>> For example, without looking at the source code its not possible to
>>> tell or infer if pushProjection() is called on the front-end or back-
>>> end, or both.  It could be implemented by being called on the front-
>>> end, expecting the loader implementation to persist necessary state
>>> to UDFContext for the back-end, or be called only on the back-end,
>>> or both.  One has to look at PigStorage source to see that it
>>> persists the pushProjection information into UDFContext, so its
>>> _probably_ only called on the front-end.
>>>
>>> There are also a few types that these interfaces return or are
>>> provided that are completely undocumented.  I had to look at the
>>> source code to figure out what ResourceStatistics does, and how
>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>>> and RequiredFieldResponse are all poorly documented aspects of a
>>> public interface.
>>>
>>>
>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>>
>>>> To add to this, there is also a how-to document on how to go about
>>>> writing load/store functions from scratch in Pig 0.7 at
>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>>
>>>> Pradeep
>>>>
>>>> -----Original Message-----
>>>> From: Alan Gates [mailto:[email protected]]
>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>> To: [email protected]
>>>> Cc: Eli Collins
>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>>
>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>>> be remembering incorrectly) asked if there was a migration guide for
>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>>> was
>>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>>> had already been posted to
>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>> .  Sorry, I should have included those links in my original slides.
>>>>
>>>> Alan.
>>>
>>
>
>

Re: Pig loader 0.6 to 0.7 migration guide

Reply via email to