[jira] [Updated] (PIG-4697) Pig needs to serialize only part of the udfcontext for each vertex

Rohini Palaniswamy (JIRA) Wed, 14 Oct 2015 23:43:18 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rohini Palaniswamy updated PIG-4697:
------------------------------------
    Attachment: PIG-4697-1.patch

Changes done:
   - Only serialize LoadFuncs in MRInput payload
   - Only serialize StoreFuncs in MROutput payload
   - Serialize all LoadFunc, StoreFunc, EvalFunc in Processor payload
   - Serialize only EvalFunc in edge payload for Combiner
   - If we cannot match a UDFContextKey to the plan of any of the vertices, we 
add it everywhere.
   - Keep a local copy of serialized PigContext, udf.import.list and tez plan 
instead of serializing again and again.
   - Make a copy of payLoadConf earlier for MRInput/MROutput payload and avoid 
having lot of unnecessary stuff being copied over.


> Pig needs to serialize only part of the udfcontext for each vertex
> ------------------------------------------------------------------
>
>                 Key: PIG-4697
>                 URL: https://issues.apache.org/jira/browse/PIG-4697
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>         Attachments: PIG-4697-1.patch
>
>
>   What HCatLoader/HCatStorer put in UDFContext is huge and if there are 
> multiple of them in the pig script, the size of data sent to Tez AM is huge 
> and the size of data that Tez AM to tasks is huge and causing either RPC 
> limit exceeded or OOM issues.  If Pig serializes only part of the udfcontext 
> that is required for each vertex, it will save a lot.  HCat folks are also 
> looking up at cleaning what goes into the conf (it ends up serializing whole 
> job conf, not just hive-site.xml) and moving out the common part to be shared 
> by all hcat loaders and stores. 
> Also looking at other options for faster and compact serialization. Will 
> create separate jiras for that. Will use PIG-4653 to cleanup all other pig 
> config other than udfcontext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4697) Pig needs to serialize only part of the udfcontext for each vertex

Reply via email to