[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5523:
---------------------------------
    Description: 
 TaskMetrics and TaskInfo objects have the hostname associated with the task. 
As these are created (directly or through deserialization of RPC messages), 
each of them have a separate String object for the hostname even though most of 
them have the same string data in them. This results in thousands of string 
objects, increasing memory requirement of the driver. 
This can be easily deduped when deserializing a TaskMetrics object, or when 
creating a TaskInfo object.

This affects streaming particularly bad due to the rate of job/stage/task 
generation. 

For solution, see how this dedup is done for StorageLevel. 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
 

  was:
 TaskMetrics and TaskInfo objects have the hostname associated with the task. 
As these are created (directly or through deserialization of RPC messages), 
each of them have a separate String object for the hostname even though most of 
them have the same string data in them. This results in thousands of string 
objects, increasing memory requirement of the driver. 
This can be easily deduped when deserializing a TaskMetrics object, or when 
creating a TaskInfo object (in TaskSchedulerImpl).

This affects streaming particularly bad due to the rate of job/stage/task 
generation. 

For solution, see how this dedup is done for StorageLevel. 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
 


> TaskMetrics and TaskInfo have innumerable copies of the hostname string
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5523
>                 URL: https://issues.apache.org/jira/browse/SPARK-5523
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Streaming
>            Reporter: Tathagata Das
>
>  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
> As these are created (directly or through deserialization of RPC messages), 
> each of them have a separate String object for the hostname even though most 
> of them have the same string data in them. This results in thousands of 
> string objects, increasing memory requirement of the driver. 
> This can be easily deduped when deserializing a TaskMetrics object, or when 
> creating a TaskInfo object.
> This affects streaming particularly bad due to the rate of job/stage/task 
> generation. 
> For solution, see how this dedup is done for StorageLevel. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to