Re: Pig loader 0.6 to 0.7 migration guide

Scott Carey Wed, 02 Jun 2010 10:08:39 -0700

On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:

> Scott,
>       You can set hadoop properties at the time of running your pig script 
> with -D option. So
> pig -Dhadoop.property.name=something myscript essentially sets the property 
> in the job configuration.
>


So no programatic configuration of hadoop properties is allowed (where its 
easier to control) but its allowable to set it at the script level?  I guess I 
can do that, but it complicates things.  
Also this is a very poor way to do this.  My script has 600 lines of Pig and 
~45 M/R jobs.  Only three of the jobs need the distributed cache, not all 45.

> Speaking specifically of utilizing the distributed cache feature, you can 
> just set the filename in LoadFunc constructor and then load the data in 
> memory in getNext() method if not already loaded.
> 

That is what the original idea was.

> Here is the pig command to set up the distributed cache
> 
> pig 
> -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name
>    ---> This name needs to be passed to UDF constructor so that its available 
> in mapper/reducer's working dir on compute node.
>       -Dmapred.create.symlink=yes
>       script.pig

If that property is set, then constructor only needs file-name (the symlink) 
right?  Right now I'm trying to set those properties using the DistributedCache 
static interfaces which means I need to have access to the full path.

> 
> Implement something like a loadData() method that loads the data only once 
> and invoke it from getNext() method. The script will work even in the local 
> mode if the file distributed via distributed cache resides in the CWD from 
> where script is invoked.
> 

I'm loading the data in prepareToRead(), which seems most appropriate.  Do you 
see any problem with that?

> Hope that's helpful.

I think the command line property hack is insufficient.  I am left with a 
choice of having a couple jobs read the file from HDFS directly in their 
mappers, or having all jobs unnecessarily set up distributed cache.  Job setup 
time is already 1/4 of my processing time.
Is there a feature request for Load/Store access to Hadoop job configuration 
properties?

Ideally, this would be a method on LoadFunc that passes a modifiable 
Configuration object in on the front-end, or a callback for a user to 
optionally provide a Configuration object with the few properties you want to 
alter in it that Pig can apply to the real thing before it configures its 
properties.

Thanks for the info Ankur,

-Scott

> 
> -...@nkur
> 
> On 6/2/10 2:53 PM, "Scott Carey" <[email protected]> wrote:
> 
> So, here are some things I'm struggling with now:
> 
> In a LoadFunc, If I want to load something into DistributedCache.  The path 
> is passed into the LoadFunc constructor as an argument.
> Documentation on getSchema() and all other metadata methods state that you 
> can't modify the job or its configuration passed in.  I've verified that 
> changes to the Configuration are ignored if set here.
> 
> It appears that I could set these properties in setLocation() but that is 
> called a lot on the back-end too, and the documentation does not state if 
> setLocation() is called at all on the front-end.  Based on my experimental 
> results, it doesn't seem to.
> Is there no way to modify Hadoop properties on the front-end to utilize 
> hadoop features?  UDFContext seems completely useless for setting hadoop 
> properties for things other than the UDF itself -- like distributed cache 
> settings.  A stand-alone front-end hook for this would be great.  Otherwise, 
> any hack that works would be acceptable for now.
> 
> 
> * The documentation for LoadMetadata can use some information about when each 
> method gets called -- front end only?  Between what other calls?
> * UDFContext's documentation needs help too --
> ** addJobConf() is public, but not expected to be used by end-users, right?  
> Several public methods here look like they need better documentation, and the 
> class itself could use a javadoc entry with some example uses.
> 
> 
> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
> 
>> Scott,
>> 
>> I made an effort to address the documentation in 
>> https://issues.apache.org/jira/browse/PIG-1370
>> If you have a chance take a look and let me know if it deals with
>> the issues you have or if more work is needed.
>> 
>> Alan.
>> 
>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>> 
>>> I have been using these documents for a couple weeks, implementing
>>> various store and load functionality, and they have been very helpful.
>>> 
>>> However, there is room for improvement.  What is most unclear is
>>> when the API methods get called.  Each method should clearly state
>>> in these documents (and the javadoc) when it is called -- front-end
>>> only? back-end only?  both?  Sometimes this is obvious, other times
>>> it is not.
>>> For example, without looking at the source code its not possible to
>>> tell or infer if pushProjection() is called on the front-end or back-
>>> end, or both.  It could be implemented by being called on the front-
>>> end, expecting the loader implementation to persist necessary state
>>> to UDFContext for the back-end, or be called only on the back-end,
>>> or both.  One has to look at PigStorage source to see that it
>>> persists the pushProjection information into UDFContext, so its
>>> _probably_ only called on the front-end.
>>> 
>>> There are also a few types that these interfaces return or are
>>> provided that are completely undocumented.  I had to look at the
>>> source code to figure out what ResourceStatistics does, and how
>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>>> and RequiredFieldResponse are all poorly documented aspects of a
>>> public interface.
>>> 
>>> 
>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>> 
>>>> To add to this, there is also a how-to document on how to go about
>>>> writing load/store functions from scratch in Pig 0.7 at
>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>> 
>>>> Pradeep
>>>> 
>>>> -----Original Message-----
>>>> From: Alan Gates [mailto:[email protected]]
>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>> To: [email protected]
>>>> Cc: Eli Collins
>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>> 
>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>>> be remembering incorrectly) asked if there was a migration guide for
>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>>> was
>>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>>> had already been posted to
>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>> .  Sorry, I should have included those links in my original slides.
>>>> 
>>>> Alan.
>>> 
>> 
> 
>

Re: Pig loader 0.6 to 0.7 migration guide

Reply via email to