DOYUNG YOON created S2GRAPH-252:

             Summary: Improve performance of S2GraphSource 
                 Key: S2GRAPH-252
             Project: S2Graph
          Issue Type: Improvement
          Components: s2jobs
            Reporter: DOYUNG YOON

S2GraphSource is responsible to translate HBASE 
snapshot(`TableSnapshotInputFormat`) to graph element such as edge/vertex.

below code create `RDD[(ImmutableBytesWritable, Result)]` from 

val rdd = ss.sparkContext.newAPIHadoopRDD(job.getConfiguration,

The problem comes after obtaining RDD. 

Current implementation use `RDD.mapPartitions` because S2Graph class is not 
serializable, mostly because it has Asynchbase client in it.

The problematic part is the following.

val elements = input.mapPartitions { iter =>
      val s2 = S2GraphHelper.getS2Graph(config)

      iter.flatMap { line =>

    val kvs = elements.mapPartitions { iter =>
      val s2 = S2GraphHelper.getS2Graph(config)

On each RDD partition, S2Graph instance connect meta storage, such as mysql, 
and use the local cache to avoid heavy read from meta storage.

Even though it works with a dataset with the small partition, the scalability 
of S2GraphSource limited by the number of partitions, which need to be 
increased when dealing with large data.

Possible improvement can be achieved by not depending on meta storage when it 
deserializes HBase's Result class into Edge/Vertex. 

We can simply achieve this by loading all necessary schemas from meta storage 
on spark driver, then broadcast these schemas and use them to deserialize 
instead of connecting meta storage on each partition.

This message was sent by Atlassian JIRA

Reply via email to