GitHub user yhuai opened a pull request:
https://github.com/apache/spark/pull/1504
[SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java
collections to Scala collections JsonRDD.scala
In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to
a Scala one. These two operations are pretty expensive because they read
elements from a Java Map/List and then load to a Scala Map/List. We can use
Scala wrappers to wrap those Java collections instead of using toMap/toList.
I did a quick test to see the performance. I had a 2.9GB cached RDD[String]
storing one JSON object per record. I tried `sqlContext.jsonRDD(jsonData)` to
see the performance. The job running for schema inference had one stage and
there were 48 tasks. These tasks were executed sequentially.
```
Original:
Run 1: 1.5 min
Run 2: 1.4 min
Run 3: 1.5 min
With this change:
Run 1: 1.2 min
Run 2: 1.2 min
Run 3: 1.2 min
```
JIRA: https://issues.apache.org/jira/browse/SPARK-2603
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yhuai/spark removeToMapToList
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1504.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1504
----
commit d1abdb8dc8dc2c8d9abedd5e2f8eff5f1f754e2b
Author: Yin Huai <[email protected]>
Date: 2014-07-21T00:29:32Z
Remove unnecessary toMap and toList.
commit 09b9bca2773c259a8d660a302252b994f1bf821e
Author: Yin Huai <[email protected]>
Date: 2014-07-21T01:01:54Z
Merge remote-tracking branch 'upstream/master' into removeToMapToList
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---