[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

Alan Gates (JIRA) Tue, 21 Sep 2010 11:32:58 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913154#action_12913154
 ]


Alan Gates commented on PIG-1337:
---------------------------------

The problem with allowing load and store functions access to the config file is 
that the config file they see is not the config file that goes to Hadoop.  This 
is not all Pig's fault (see comments above on this).  The other problem is that 
multiple instances of the same load and store function may be operating in a 
given script, so there are namespace issues to resolve.

The proposal for Hadoop 0.22 is that rather than providing access to the config 
file at all Hadoop will serialize objects such as InputFormat and OutputFormat 
and pass those to the backend.  It will make sense for Pig to follow suit and 
serialize all UDFs on the front end.  This will remove the need for the  
UDFContext black magic that we do at the moment and should allow all UDFs to 
easily transfer information from front end to backend.

So, hopefully this can get resolved when Pig migrates to Hadoop 0.22, whenever 
that is.

> Need a way to pass distributed cache configuration information to hadoop 
> backend in Pig's LoadFunc
> --------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1337
>                 URL: https://issues.apache.org/jira/browse/PIG-1337
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Chao Wang
>
> The Zebra storage layer needs to use distributed cache to reduce name node 
> load during job runs.
> To to this, Zebra needs to set up distributed cache related configuration 
> information in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object 
> here is not the one that is being serialized to map/reduce backend. As such, 
> the distributed cache is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that 
> we can use to set up distributed cache information in a conf object, and this 
> conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

Reply via email to