[ https://issues.apache.org/jira/browse/S2GRAPH-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731175#comment-16731175 ]
ASF GitHub Bot commented on S2GRAPH-252: ---------------------------------------- GitHub user SteamShon opened a pull request: https://github.com/apache/incubator-s2graph/pull/195 [S2GRAPH-252]: Improve performance of S2GraphSource - add SchemaManager. - add SerializeUtil/DeserializeUtil . - refactor S2GraphSink/S2GraphSource to use SerializeUtil/DeserializeUtil. You can merge this pull request into a Git repository by running: $ git pull https://github.com/SteamShon/incubator-s2graph S2GRAPH-252 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-s2graph/pull/195.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #195 ---- commit 7b2fd3576a88c0ee5a1c83a39fe451960bebcab9 Author: DO YUNG YOON <steamshon@...> Date: 2018-12-25T11:39:55Z add HFileParserUDF. commit 23793d47f1102fc23f7a07f4b2ac53d45e45e0ef Author: DO YUNG YOON <steamshon@...> Date: 2018-12-26T02:00:30Z add LabelSchema. commit b7e58f6dcee79c634126f7f9cf60caf42719832c Author: DO YUNG YOON <steamshon@...> Date: 2018-12-26T02:29:50Z change S2GraphSource to use DeserializeUtil directly on Result. commit dfa76a9d0d0d5d932c7a2a9bcfa43c77485adaae Author: DO YUNG YOON <steamshon@...> Date: 2018-12-26T03:56:47Z add error handling. commit 33388de1732beaea8375dee8db74a2c4f619603e Author: DO YUNG YOON <steamshon@...> Date: 2018-12-27T04:55:36Z directly deserialize cell. commit 2942e42eb00e5f9fa19ecf23319d615df3a2e87a Author: DO YUNG YOON <steamshon@...> Date: 2018-12-29T00:47:41Z tmp. commit 952cdf68480eb2e1c1a6292102555ea5bdee7d46 Author: DO YUNG YOON <steamshon@...> Date: 2018-12-29T11:31:07Z add DeserializeSchema/SerializeSchema. commit c750774202390f00a58aa279b7cfe2245248699f Author: DO YUNG YOON <steamshon@...> Date: 2018-12-30T05:59:47Z merge DeserializeSchema and SerializeSchema to SchemaManager. commit 420c0bce3011fc89d4ad08a58166470681d308d7 Author: DO YUNG YOON <steamshon@...> Date: 2018-12-31T03:09:23Z bug fix on wide/tall schema on Vertex/SerializeUtil. commit 48b1594417f476256383811405bae8728f3e1780 Author: DO YUNG YOON <steamshon@...> Date: 2018-12-31T03:23:52Z Need to pass right spark sql schema on createDataFrame. commit bf5cfe61ba714cf5266183d48074a3a954c76536 Author: DO YUNG YOON <steamshon@...> Date: 2018-12-31T03:35:41Z support vertex in S2GraphSource. commit b8ecf23908ea10b264f76ebb257a5efe084a93d5 Author: DO YUNG YOON <steamshon@...> Date: 2018-12-31T05:13:35Z refactor S2GraphSink bulkload to use SchemaManager to build RDD[KeyValue]. ---- > Improve performance of S2GraphSource > ------------------------------------- > > Key: S2GRAPH-252 > URL: https://issues.apache.org/jira/browse/S2GRAPH-252 > Project: S2Graph > Issue Type: Improvement > Components: s2jobs > Reporter: DOYUNG YOON > Assignee: DOYUNG YOON > Priority: Major > Original Estimate: 336h > Remaining Estimate: 336h > > S2GraphSource is responsible to translate HBASE > snapshot(*TableSnapshotInputFormat*) to graph element such as edge/vertex. > below code create *RDD[(ImmutableBytesWritable, Result)]* from > *TableSnapshotInputFormat* > {noformat} > val rdd = ss.sparkContext.newAPIHadoopRDD(job.getConfiguration, > classOf[TableSnapshotInputFormat], > classOf[ImmutableBytesWritable], > classOf[Result]) > {noformat} > The problem comes after obtaining RDD. > Current implementation use *RDD.mapPartitions* because S2Graph class is not > serializable, mostly because it has Asynchbase client in it. > The problematic part is the following. > {noformat} > val elements = input.mapPartitions { iter => > val s2 = S2GraphHelper.getS2Graph(config) > iter.flatMap { line => > reader.read(s2)(line) > } > } > val kvs = elements.mapPartitions { iter => > val s2 = S2GraphHelper.getS2Graph(config) > iter.map(writer.write(s2)(_)) > } > {noformat} > On each RDD partition, S2Graph instance connect meta storage, such as mysql, > and use the local cache to avoid heavy read from meta storage. > Even though it works with a dataset with the small partition, the scalability > of S2GraphSource limited by the number of partitions, which need to be > increased when dealing with large data. > Possible improvement can be achieved by not depending on meta storage when it > deserializes HBase's Result class into Edge/Vertex. > We can simply achieve this by loading all necessary schemas from meta storage > on spark driver, then broadcast these schemas and use them to deserialize > instead of connecting meta storage on each partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)