[ 
https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7075:
-------------------------------
    Description: 
Based on our observation, majority of Spark workloads are not bottlenecked by 
I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
improve the efficiency of memory and CPU for Spark applications, to push 
performance closer to the limits of the underlying hardware.

*Memory Management and Binary Processing*
- Avoiding non-transient Java objects (store them in binary format), which 
reduces GC overhead.
- Minimizing memory usage through denser in-memory data format, which means we 
spill less.
- Better memory accounting (size of bytes) rather than relying on heuristics
- For operators that understand data types (in the case of DataFrames and SQL), 
work directly against binary format in memory, i.e. have no 
serialization/deserialization

*Cache-aware Computation*
- Faster sorting and hashing for aggregations, joins, and shuffle

*Code Generation*
- Faster expression evaluation and DataFrame/SQL operators
- Faster serializer


Several parts of project Tungsten leverage the DataFrame model, which gives us 
more semantics about the application. We will also retrofit the improvements 
onto Spark’s RDD API whenever possible.


  was:
Based on our observation, majority of Spark workloads are not bottlenecked by 
I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
improve the efficiency of memory and CPU for Spark applications, to push 
performance closer to the limits of the underlying hardware.

*Memory Management and Binary Processing*
- Avoiding non-transient Java objects (store them in binary format), which 
reduces GC overhead.
- Minimizing memory usage through denser in-memory data format, which means we 
spill less.
- Better memory accounting (size of bytes) rather than relying on heuristics
- For operators that understand data types (in the case of DataFrames and SQL), 
work directly against binary format in memory, i.e. have no 
serialization/deserialization

*Cache-aware Computation*
- Faster sorting and hashing for aggregations, joins, and shuffle

*Code Generation*
- Faster expression evaluation and SQL operators
- Faster serializer


Several parts of project Tungsten leverage the DataFrame model, which gives us 
more semantics about the application. We will also retrofit the improvements 
onto Spark’s RDD API whenever possible.



> Project Tungsten: Improving Physical Execution and Memory Management
> --------------------------------------------------------------------
>
>                 Key: SPARK-7075
>                 URL: https://issues.apache.org/jira/browse/SPARK-7075
>             Project: Spark
>          Issue Type: Epic
>          Components: Block Manager, Shuffle, Spark Core, SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> Based on our observation, majority of Spark workloads are not bottlenecked by 
> I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
> improve the efficiency of memory and CPU for Spark applications, to push 
> performance closer to the limits of the underlying hardware.
> *Memory Management and Binary Processing*
> - Avoiding non-transient Java objects (store them in binary format), which 
> reduces GC overhead.
> - Minimizing memory usage through denser in-memory data format, which means 
> we spill less.
> - Better memory accounting (size of bytes) rather than relying on heuristics
> - For operators that understand data types (in the case of DataFrames and 
> SQL), work directly against binary format in memory, i.e. have no 
> serialization/deserialization
> *Cache-aware Computation*
> - Faster sorting and hashing for aggregations, joins, and shuffle
> *Code Generation*
> - Faster expression evaluation and DataFrame/SQL operators
> - Faster serializer
> Several parts of project Tungsten leverage the DataFrame model, which gives 
> us more semantics about the application. We will also retrofit the 
> improvements onto Spark’s RDD API whenever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to