Leonidas Fegaras created MRQL-98: ------------------------------------ Summary: Improve Data Serialization in Spark Evaluation Key: MRQL-98 URL: https://issues.apache.org/jira/browse/MRQL-98 Project: MRQL Issue Type: Improvement Components: Run-Time/Spark Affects Versions: 0.9.8 Reporter: Leonidas Fegaras Assignee: Leonidas Fegaras Priority: Critical
MRQL data (MRData) are serialized as Writable (for Hadoop Map-Reduce), Java Serializable (for Spark), and CopyableValue (for Flink). Until now, the Spark MRQL engine was using a wrapper for MRData (called MRContainer) to serialize data using the Writable methods. Some data used in Spark mode though were left unwrapped, so Spark was using the default Java serialization, which was inefficient. With this patch, MRData becomes Serializable with custom serialization methods that are very efficient. My performance evaluation of the Pagerank query over 10 millions links run on a cluster with 16 cores gives 38% improvement compared to the old Spark evaluation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)