Reynold Xin created SPARK-17949:
-----------------------------------
Summary: Introduce a JVM object based aggregate operator
Key: SPARK-17949
URL: https://issues.apache.org/jira/browse/SPARK-17949
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Reynold Xin
The new Tungsten execution engine has very robust memory management and speed
for simple data types. It does, however, suffer from the following:
1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is
fairly expensive to fit into the Tungsten internal format.
2. For aggregate functions that require complex intermediate data structures,
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of
structs.
The idea here is to introduce an JVM object based hash aggregate operator that
can support the aforementioned use cases. This operator, however, should limit
its memory usage to avoid putting too much pressure on GC, e.g. falling back to
sort-based aggregate as soon the number of objects exceed a very low threshold.
Internally at Databricks we prototyped a version of this for a customer POC and
have observed substantial speed-ups over existing Spark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]