[
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967911#comment-13967911
]
Patrick Wendell edited comment on SPARK-1476 at 4/13/14 7:01 PM:
-----------------------------------------------------------------
[[email protected]] I think the proposed change would benefit from a
design doc to explain exactly the cases we want to fix and what trade-offs we
are willing to make in terms of complexity.
Agreed that there is definitely room for improvement in the out-of-the-box
behavior here.
Right now the limits as I understand them are (a) the shuffle output from one
mapper to one reducer cannot be more than 2GB. (b) partitions of an RDD cannot
exceed 2GB.
I see (a) as the bigger of the two issues. It would be helpful to have specific
examples of workloads where this causes a problem and the associated data
sizes, etc. For instance, say I want to do a 1 Terabyte shuffle. Right now
number of (mappers * reducers) needs to be > ~1000 for this to work (e.g. 100
mappers and 10 reducers) assuming uniform partitioning. That doesn't seem too
crazy of an assumption, but if you have skew this would be a much bigger
problem.
Would it be possible to improve (a) but not (b) with a much simpler design? I'm
not sure (maybe they reduce to the same problem), but it's something a design
doc could help flesh out.
Popping up a bit - I think our goal should be to handle reasonable workloads
and not to be 100% compliant with the semantics of Hadoop MapReduce. After all,
in-memory RDD's are not even a concept in MapReduce. And keep in mind that
MapReduce became so bloated/complex of a project that it is today no longer
possible to make substantial changes to it. That's something we definitely want
to avoid by keeping Spark internals as simple as possible.
was (Author: pwendell):
[[email protected]] I think the proposed change would benefit from a
design doc to explain exactly the cases we want to fix and what trade-offs we
are willing to make in terms of complexity.
Agreed that there is definitely room for improvement in the out-of-the-box
behavior here.
Right now the limits as I understand them are (a) the shuffle output from one
mapper to one reducer cannot be more than 2GB. (b) partitions of an RDD cannot
exceed 2GB.
I see (a) as the bigger of the two issues. It would be helpful to have specific
examples of workloads where this causes a problem and the associated data
sizes, etc. For instance, say I want to do a 1 Terabyte shuffle. Right now
number of (mappers * reducers) needs to be > ~1000 for this to work (e.g. 100
mappers and 10 reducers) assuming uniform partitioning. That doesn't seem too
crazy of an assumption, but if you have skew this would be a much bigger
problem.
Popping up a bit - I think our goal should be to handle reasonable workloads
and not to be 100% compliant with the semantics of Hadoop MapReduce. After all,
in-memory RDD's are not even a concept in MapReduce. And keep in mind that
MapReduce became so bloated/complex of a project that it is today no longer
possible to make substantial changes to it. That's something we definitely want
to avoid by keeping Spark internals as simple as possible.
> 2GB limit in spark for blocks
> -----------------------------
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Environment: all
> Reporter: Mridul Muralidharan
> Priority: Critical
> Fix For: 1.1.0
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle
> blocks (memory mapped blocks are limited to 2gig, even though the api allows
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial
> datasets.
--
This message was sent by Atlassian JIRA
(v6.2#6252)