Yep, that makes sense. Thanks for the clarification! Mingyu
On 3/4/15, 8:05 PM, "Patrick Wendell" <pwend...@gmail.com> wrote: >Yeah, it will result in a second serialized copy of the array (costing >some memory). But the computational overhead should be very small. The >absolute worst case here will be when doing a collect() or something >similar that just bundles the entire partition. > >- Patrick > >On Wed, Mar 4, 2015 at 5:47 PM, Mingyu Kim <m...@palantir.com> wrote: >> The concern is really just the runtime overhead and memory footprint of >> Java-serializing an already-serialized byte array again. We originally >> noticed this when we were using RDD.toLocalIterator() which serializes >>the >> entire 64MB partition. We worked around this issue by kryo-serializing >>and >> snappy-compressing the partition on the executor side before returning >>it >> back to the driver, but this operation just felt redundant. >> >> Your explanation about reporting the time taken makes it clearer why >>it¹s >> designed this way. Since the byte array for the serialized task result >> shouldn¹t account for the majority of memory footprint anyways, I¹m okay >> with leaving it as is, then. >> >> Thanks, >> Mingyu >> >> >> >> >> >> On 3/4/15, 5:07 PM, "Patrick Wendell" <pwend...@gmail.com> wrote: >> >>>Hey Mingyu, >>> >>>I think it's broken out separately so we can record the time taken to >>>serialize the result. Once we serializing it once, the second >>>serialization should be really simple since it's just wrapping >>>something that has already been turned into a byte buffer. Do you see >>>a specific issue with serializing it twice? >>> >>>I think you need to have two steps if you want to record the time >>>taken to serialize the result, since that needs to be sent back to the >>>driver when the task completes. >>> >>>- Patrick >>> >>>On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim <m...@palantir.com> wrote: >>>> Hi all, >>>> >>>> It looks like the result of task is serialized twice, once by >>>>serializer (I.e. Java/Kryo depending on configuration) and once again >>>>by >>>>closure serializer (I.e. Java). To link the actual code, >>>> >>>> The first one: >>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ >>>>sp >>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor. >>>>sc >>>>ala-23L213&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=enn >>>>QJ >>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH >>>>-9 >>>>WMY_2Z07ulA&s=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aA&e= >>>> The second one: >>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ >>>>sp >>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor. >>>>sc >>>>ala-23L226&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=enn >>>>QJ >>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH >>>>-9 >>>>WMY_2Z07ulA&s=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxM&e= >>>> >>>> This serializes the "value", which is the result of task run twice, >>>>which affects things like collect(), takeSample(), and >>>>toLocalIterator(). Would it make sense to simply serialize the >>>>DirectTaskResult once using the regular "serializer" (as opposed to >>>>closure serializer)? Would it cause problems when the Accumulator >>>>values >>>>are not Kryo-serializable? >>>> >>>> Alternatively, if we can assume that Accumator values are small, we >>>>can >>>>closure-serialize those, put the serialized byte array in >>>>DirectTaskResult with the raw task result "value", and serialize >>>>DirectTaskResult. >>>> >>>> What do people think? >>>> >>>> Thanks, >>>> Mingyu >>