[ 
https://issues.apache.org/jira/browse/SPARK-5883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-5883:
------------------------------------
    Description: 
The size of shipped data between vertex partitions and edge partitions
is one of major issues for better performance.
SPAR-3649 indicated the ~10% performance gain in Pregel iterations
by using the custom serializers for ShuffledRDD.

However, it is kind of tough to implement efficient serializers for ShuffledRDD
inside GraphX because 1)how to use serializers in ShuffledRDD is different
between SortShuffleManager and HashShuffleManager (See SPARK-3649)
and 2)the type of 'VD' is unknown to GraphX.

Therefore, I think that compressing shippded data inside GraphX
(before they are passed into ShuffleRDD) is one of better solutions for that.
GraphX users register user-defined serializer for VD, and then
GraphX uses the serializer so as to compress shipped data between
vertex partitions and edge ones.

My current patch applies this idea in ReplicatedVertexView#upgrade
and ReplicatedVertexView#updateVertices.
https://github.com/maropu/spark/commit/665b6c4a273b90e7c6e1545f982c7576a0e5ceb2

Also, it can be applied into ReplicatedVertexView#withActiveSet
and VertexRDDImpl#aggregateUsingIndex.

I'm not sure that this design is acceptable, so any advice welcomed.

  was:
The size of shipped data between vertex partitions and edge partitions
is one of major issues for better performance.
SPAR-3649 indicated the ~10% performance gain in Pregel iterations
by using the custom serializers for ShuffledRDD.

However, it is kind of tough to implement efficient serializers for ShuffledRDD
inside GraphX because 1)how to use serializers in ShuffledRDD is different
between SortShuffleManager and HashShuffleManager (See SPARK-3649)
and 2)the type of 'VD' is unknown to GraphX.

Therefore, I think that compressing shippded data inside GraphX
(before they are passed into ShuffleRDD) is one of better solutions for that.
GraphX users register user-defined serializer for VD, and then
GraphX uses the serializer so as to compress shipped data between
vertex partitions and edge ones.

My current patch applies this idea in ReplicatedVertexView#upgrade
and ReplicatedVertexView#updateVertices.
Also, it can be applied into ReplicatedVertexView#withActiveSet
and VertexRDDImpl#aggregateUsingIndex.

I'm not sure that this design is acceptable, so any advice welcomed.


> Add compression scheme in VertexAttributeBlock for shipping vertices to edge 
> partitions
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-5883
>                 URL: https://issues.apache.org/jira/browse/SPARK-5883
>             Project: Spark
>          Issue Type: Improvement
>          Components: GraphX
>            Reporter: Takeshi Yamamuro
>
> The size of shipped data between vertex partitions and edge partitions
> is one of major issues for better performance.
> SPAR-3649 indicated the ~10% performance gain in Pregel iterations
> by using the custom serializers for ShuffledRDD.
> However, it is kind of tough to implement efficient serializers for 
> ShuffledRDD
> inside GraphX because 1)how to use serializers in ShuffledRDD is different
> between SortShuffleManager and HashShuffleManager (See SPARK-3649)
> and 2)the type of 'VD' is unknown to GraphX.
> Therefore, I think that compressing shippded data inside GraphX
> (before they are passed into ShuffleRDD) is one of better solutions for that.
> GraphX users register user-defined serializer for VD, and then
> GraphX uses the serializer so as to compress shipped data between
> vertex partitions and edge ones.
> My current patch applies this idea in ReplicatedVertexView#upgrade
> and ReplicatedVertexView#updateVertices.
> https://github.com/maropu/spark/commit/665b6c4a273b90e7c6e1545f982c7576a0e5ceb2
> Also, it can be applied into ReplicatedVertexView#withActiveSet
> and VertexRDDImpl#aggregateUsingIndex.
> I'm not sure that this design is acceptable, so any advice welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to