[
https://issues.apache.org/jira/browse/S2GRAPH-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DOYUNG YOON reassigned S2GRAPH-252:
-----------------------------------
Assignee: DOYUNG YOON
> Improve performance of S2GraphSource
> -------------------------------------
>
> Key: S2GRAPH-252
> URL: https://issues.apache.org/jira/browse/S2GRAPH-252
> Project: S2Graph
> Issue Type: Improvement
> Components: s2jobs
> Reporter: DOYUNG YOON
> Assignee: DOYUNG YOON
> Priority: Major
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> S2GraphSource is responsible to translate HBASE
> snapshot(`TableSnapshotInputFormat`) to graph element such as edge/vertex.
> below code create `RDD[(ImmutableBytesWritable, Result)]` from
> TableSnapshotInputFormat
> {noformat}
> val rdd = ss.sparkContext.newAPIHadoopRDD(job.getConfiguration,
> classOf[TableSnapshotInputFormat],
> classOf[ImmutableBytesWritable],
> classOf[Result])
> {noformat}
> The problem comes after obtaining RDD.
> Current implementation use `RDD.mapPartitions` because S2Graph class is not
> serializable, mostly because it has Asynchbase client in it.
> The problematic part is the following.
> {noformat}
> val elements = input.mapPartitions { iter =>
> val s2 = S2GraphHelper.getS2Graph(config)
> iter.flatMap { line =>
> reader.read(s2)(line)
> }
> }
> val kvs = elements.mapPartitions { iter =>
> val s2 = S2GraphHelper.getS2Graph(config)
> iter.map(writer.write(s2)(_))
> }
> {noformat}
> On each RDD partition, S2Graph instance connect meta storage, such as mysql,
> and use the local cache to avoid heavy read from meta storage.
> Even though it works with a dataset with the small partition, the scalability
> of S2GraphSource limited by the number of partitions, which need to be
> increased when dealing with large data.
> Possible improvement can be achieved by not depending on meta storage when it
> deserializes HBase's Result class into Edge/Vertex.
> We can simply achieve this by loading all necessary schemas from meta storage
> on spark driver, then broadcast these schemas and use them to deserialize
> instead of connecting meta storage on each partition.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)