[
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383280#comment-14383280
]
Sandy Ryza commented on SPARK-4550:
-----------------------------------
Java serialization appears to write out the full class name the first time an
object is written and then refer to it by an identifier afterwards:
{code}
scala> val baos = new ByteArrayOutputStream()
scala> val oos = new ObjectOutputStream(baos)
scala> oos.writeObject(new java.util.Date())
scala> oos.flush()
scala> baos.toString
res8: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: x
scala> baos.toByteArray.length
res9: Int = 46
scala> oos.writeObject(new java.util.Date())
scala> oos.flush()
scala> baos.toString
res14: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: xsq?~??w????LY6�Dx
scala> baos.toByteArray.length
res13: Int = 63
scala> oos.writeObject(new java.util.Date())
scala> oos.flush()
scala> baos.toString
res17: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6:
xsq?~??w????LY6�Dxsq?~??w????LY8?�x
scala> baos.toByteArray.length
res18: Int = 80
{code}
There might be some fancy way to listen for the class name being written out
and relocate that segment to the front of the stream. However, this seems
fairly and involved and bug-prone; my opinion is that isn't not worth it given
that Java ser is already a severely performance-impaired option. Another
option of course would be to write the class name in front of every record, but
this would bloat the serialized representation considerably.
> In sort-based shuffle, store map outputs in serialized form
> -----------------------------------------------------------
>
> Key: SPARK-4550
> URL: https://issues.apache.org/jira/browse/SPARK-4550
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, Spark Core
> Affects Versions: 1.2.0
> Reporter: Sandy Ryza
> Assignee: Sandy Ryza
> Priority: Critical
> Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala
>
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that
> it ends up storing many more java objects in memory. If Spark could store
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are
> independent from each other and occupy contiguous segments of memory. E.g.
> when Kryo reference tracking is left on, objects may contain pointers to
> objects farther back in the stream, which means that the sort can't relocate
> objects without corrupting them.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]