Marko A. Rodriguez created TINKERPOP-1108: ---------------------------------------------
Summary: Produce two RDDs from executeVertexProgram in SparkGraphComputer Key: TINKERPOP-1108 URL: https://issues.apache.org/jira/browse/TINKERPOP-1108 Project: TinkerPop Issue Type: Improvement Components: hadoop Affects Versions: 3.1.1-incubating Reporter: Marko A. Rodriguez I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I now know the reason for every shuffle, input, spill, etc. piece of data that happens during a job. There is one more optimization that MAY or MAY NOT work, but it is worth trying because if it does what I think it will do, we may get a (perhaps) 2x improvement. We current do: {code} graphRDD -> viewOutgoingMessagesRDD {code} We should do: {code} graphRDD --> viewRDD outgoingMessageRDD {code} The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus, a local join is all that is required. The {{outgoingMessageRDD}} will not be partitioned so its join will cause shuffle. Thus, after this block, we do: {code} graphRDD.join(viewRDD).mapValues(...attach the view...).join(outgoingMessageRDD) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)