Re: Pig loader 0.6 to 0.7 migration guide

Ankur C. Goel Wed, 02 Jun 2010 04:50:09 -0700

Scott,
       You can set hadoop properties at the time of running your pig script 
with -D option. So
pig -Dhadoop.property.name=something myscript essentially sets the property in 
the job configuration.

Speaking specifically of utilizing the distributed cache feature, you can just 
set the filename in LoadFunc constructor and then load the data in memory in 
getNext() method if not already loaded.

Here is the pig command to set up the distributed cache

pig 
-Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name
   ---> This name needs to be passed to UDF constructor so that its available 
in mapper/reducer's working dir on compute node.
       -Dmapred.create.symlink=yes
       script.pig

Implement something like a loadData() method that loads the data only once and 
invoke it from getNext() method. The script will work even in the local mode if 
the file distributed via distributed cache resides in the CWD from where script 
is invoked.

Hope that's helpful.

-...@nkur

On 6/2/10 2:53 PM, "Scott Carey" <[email protected]> wrote:

So, here are some things I'm struggling with now:

In a LoadFunc, If I want to load something into DistributedCache.  The path is 
passed into the LoadFunc constructor as an argument.
Documentation on getSchema() and all other metadata methods state that you 
can't modify the job or its configuration passed in.  I've verified that 
changes to the Configuration are ignored if set here.

It appears that I could set these properties in setLocation() but that is 
called a lot on the back-end too, and the documentation does not state if 
setLocation() is called at all on the front-end.  Based on my experimental 
results, it doesn't seem to.
Is there no way to modify Hadoop properties on the front-end to utilize hadoop 
features?  UDFContext seems completely useless for setting hadoop properties 
for things other than the UDF itself -- like distributed cache settings.  A 
stand-alone front-end hook for this would be great.  Otherwise, any hack that 
works would be acceptable for now.

* The documentation for LoadMetadata can use some information about when each 
method gets called -- front end only?  Between what other calls?
* UDFContext's documentation needs help too --
** addJobConf() is public, but not expected to be used by end-users, right?  
Several public methods here look like they need better documentation, and the 
class itself could use a javadoc entry with some example uses.

On May 24, 2010, at 11:06 AM, Alan Gates wrote:

> Scott,
>
> I made an effort to address the documentation in 
> https://issues.apache.org/jira/browse/PIG-1370
>  If you have a chance take a look and let me know if it deals with
> the issues you have or if more work is needed.
>
> Alan.
>
> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>
>> I have been using these documents for a couple weeks, implementing
>> various store and load functionality, and they have been very helpful.
>>
>> However, there is room for improvement.  What is most unclear is
>> when the API methods get called.  Each method should clearly state
>> in these documents (and the javadoc) when it is called -- front-end
>> only? back-end only?  both?  Sometimes this is obvious, other times
>> it is not.
>> For example, without looking at the source code its not possible to
>> tell or infer if pushProjection() is called on the front-end or back-
>> end, or both.  It could be implemented by being called on the front-
>> end, expecting the loader implementation to persist necessary state
>> to UDFContext for the back-end, or be called only on the back-end,
>> or both.  One has to look at PigStorage source to see that it
>> persists the pushProjection information into UDFContext, so its
>> _probably_ only called on the front-end.
>>
>> There are also a few types that these interfaces return or are
>> provided that are completely undocumented.  I had to look at the
>> source code to figure out what ResourceStatistics does, and how
>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>> and RequiredFieldResponse are all poorly documented aspects of a
>> public interface.
>>
>>
>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>
>>> To add to this, there is also a how-to document on how to go about
>>> writing load/store functions from scratch in Pig 0.7 at
>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>
>>> Pradeep
>>>
>>> -----Original Message-----
>>> From: Alan Gates [mailto:[email protected]]
>>> Sent: Friday, May 21, 2010 11:33 AM
>>> To: [email protected]
>>> Cc: Eli Collins
>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>
>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>> be remembering incorrectly) asked if there was a migration guide for
>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>> was
>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>> had already been posted to
>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>> .  Sorry, I should have included those links in my original slides.
>>>
>>> Alan.
>>
>

Re: Pig loader 0.6 to 0.7 migration guide

Reply via email to