[
https://issues.apache.org/jira/browse/TEZ-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338058#comment-14338058
]
Siddharth Seth commented on TEZ-2144:
-------------------------------------
bq. When you go with the option of writing input splits on HDFS it does not
cause memory pressure for AM like the other two options right?
This will not cause memory pressure for the AM. Haven't used this code path in
a while, but as long as an AM Input plugin is not used - the AM will not load
the splits into memory.
There are a couple of things to take into account in this case though.
- MRinput reading splits from an offset in a split file (similar to MR) relies
on a specific file name. This means, there'll be conflicts on LRs across
different vertices which read from HDFS. I have to check - but I believe this
will just cause re-use to not work across vertices.
- If a container does not have the specific file re-use, and it's allocated to
a task which requires the file (e.g. a MRInput read deep inside the DAG) - the
file will get re-localized, which means we'll try sending the file over RPC.
This obviously breaks - the file was originally too big to send, and Tez
re-localization is not efficient - per container instead of per node. I don't
think we have an option to disable re-localization - which may be required.
There could be work-arounds by including a fake file in each vertex.
On compression:
It doesn't look like we're compressing the payload for Inputs. For the
MRSplitDistributor case - it would make sense to include this. Not so much for
the AMSplitGenerator, since 'Configuration payloads' are typically compressed -
and that'll be the main item for the AMSplitGenerator payload.
This may help send the data over RPC - but the individual splits will sit
inside the AM heap. Depending on compression etc - this could be fairly large.
> Compressing user payload
> ------------------------
>
> Key: TEZ-2144
> URL: https://issues.apache.org/jira/browse/TEZ-2144
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
>
> Pig sets the input split information in user payload and when running against
> a table with 10s of 1000s of partitions, DAG submission fails with
> java.io.IOException: Requested data length 305844060 is longer than maximum
> configured RPC length 67108864
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)