[
https://issues.apache.org/jira/browse/PIG-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-4697:
------------------------------------
Attachment: PIG-4697-1.patch
Changes done:
- Only serialize LoadFuncs in MRInput payload
- Only serialize StoreFuncs in MROutput payload
- Serialize all LoadFunc, StoreFunc, EvalFunc in Processor payload
- Serialize only EvalFunc in edge payload for Combiner
- If we cannot match a UDFContextKey to the plan of any of the vertices, we
add it everywhere.
- Keep a local copy of serialized PigContext, udf.import.list and tez plan
instead of serializing again and again.
- Make a copy of payLoadConf earlier for MRInput/MROutput payload and avoid
having lot of unnecessary stuff being copied over.
> Pig needs to serialize only part of the udfcontext for each vertex
> ------------------------------------------------------------------
>
> Key: PIG-4697
> URL: https://issues.apache.org/jira/browse/PIG-4697
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4697-1.patch
>
>
> What HCatLoader/HCatStorer put in UDFContext is huge and if there are
> multiple of them in the pig script, the size of data sent to Tez AM is huge
> and the size of data that Tez AM to tasks is huge and causing either RPC
> limit exceeded or OOM issues. If Pig serializes only part of the udfcontext
> that is required for each vertex, it will save a lot. HCat folks are also
> looking up at cleaning what goes into the conf (it ends up serializing whole
> job conf, not just hive-site.xml) and moving out the common part to be shared
> by all hcat loaders and stores.
> Also looking at other options for faster and compact serialization. Will
> create separate jiras for that. Will use PIG-4653 to cleanup all other pig
> config other than udfcontext.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)