ASF GitHub Bot commented on S2GRAPH-252:

GitHub user SteamShon opened a pull request:


    [S2GRAPH-252]: Improve performance of S2GraphSource

    - add SchemaManager.
    - add SerializeUtil/DeserializeUtil .
    - refactor S2GraphSink/S2GraphSource to use SerializeUtil/DeserializeUtil.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/SteamShon/incubator-s2graph S2GRAPH-252

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #195
commit 7b2fd3576a88c0ee5a1c83a39fe451960bebcab9
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-25T11:39:55Z

    add HFileParserUDF.

commit 23793d47f1102fc23f7a07f4b2ac53d45e45e0ef
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-26T02:00:30Z

    add LabelSchema.

commit b7e58f6dcee79c634126f7f9cf60caf42719832c
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-26T02:29:50Z

    change S2GraphSource to use DeserializeUtil directly on Result.

commit dfa76a9d0d0d5d932c7a2a9bcfa43c77485adaae
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-26T03:56:47Z

    add error handling.

commit 33388de1732beaea8375dee8db74a2c4f619603e
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-27T04:55:36Z

    directly deserialize cell.

commit 2942e42eb00e5f9fa19ecf23319d615df3a2e87a
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-29T00:47:41Z


commit 952cdf68480eb2e1c1a6292102555ea5bdee7d46
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-29T11:31:07Z

    add DeserializeSchema/SerializeSchema.

commit c750774202390f00a58aa279b7cfe2245248699f
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-30T05:59:47Z

    merge DeserializeSchema and SerializeSchema to SchemaManager.

commit 420c0bce3011fc89d4ad08a58166470681d308d7
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-31T03:09:23Z

    bug fix on wide/tall schema on Vertex/SerializeUtil.

commit 48b1594417f476256383811405bae8728f3e1780
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-31T03:23:52Z

    Need to pass right spark sql schema on createDataFrame.

commit bf5cfe61ba714cf5266183d48074a3a954c76536
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-31T03:35:41Z

    support vertex in S2GraphSource.

commit b8ecf23908ea10b264f76ebb257a5efe084a93d5
Author: DO YUNG YOON <steamshon@...>
Date:   2018-12-31T05:13:35Z

    refactor S2GraphSink bulkload to use SchemaManager to build RDD[KeyValue].


> Improve performance of S2GraphSource 
> -------------------------------------
>                 Key: S2GRAPH-252
>                 URL: https://issues.apache.org/jira/browse/S2GRAPH-252
>             Project: S2Graph
>          Issue Type: Improvement
>          Components: s2jobs
>            Reporter: DOYUNG YOON
>            Assignee: DOYUNG YOON
>            Priority: Major
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> S2GraphSource is responsible to translate HBASE 
> snapshot(*TableSnapshotInputFormat*) to graph element such as edge/vertex.
> below code create *RDD[(ImmutableBytesWritable, Result)]* from 
> *TableSnapshotInputFormat*
> {noformat}
> val rdd = ss.sparkContext.newAPIHadoopRDD(job.getConfiguration,
>         classOf[TableSnapshotInputFormat],
>         classOf[ImmutableBytesWritable],
>         classOf[Result])
> {noformat}
> The problem comes after obtaining RDD. 
> Current implementation use *RDD.mapPartitions* because S2Graph class is not 
> serializable, mostly because it has Asynchbase client in it.
> The problematic part is the following.
> {noformat}
> val elements = input.mapPartitions { iter =>
>       val s2 = S2GraphHelper.getS2Graph(config)
>       iter.flatMap { line =>
>         reader.read(s2)(line)
>       }
>     }
>     val kvs = elements.mapPartitions { iter =>
>       val s2 = S2GraphHelper.getS2Graph(config)
>       iter.map(writer.write(s2)(_))
>     }
> {noformat}
> On each RDD partition, S2Graph instance connect meta storage, such as mysql, 
> and use the local cache to avoid heavy read from meta storage.
> Even though it works with a dataset with the small partition, the scalability 
> of S2GraphSource limited by the number of partitions, which need to be 
> increased when dealing with large data.
> Possible improvement can be achieved by not depending on meta storage when it 
> deserializes HBase's Result class into Edge/Vertex. 
> We can simply achieve this by loading all necessary schemas from meta storage 
> on spark driver, then broadcast these schemas and use them to deserialize 
> instead of connecting meta storage on each partition.

This message was sent by Atlassian JIRA

Reply via email to