GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/1504

    [SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java 
collections to Scala collections JsonRDD.scala

    In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to 
a Scala one. These two operations are pretty expensive because they read 
elements from a Java Map/List and then load to a Scala Map/List. We can use 
Scala wrappers to wrap those Java collections instead of using toMap/toList.
    
    I did a quick test to see the performance. I had a 2.9GB cached RDD[String] 
storing one JSON object per record. I tried `sqlContext.jsonRDD(jsonData)` to 
see the performance. The job running for schema inference had one stage and 
there were 48 tasks. These tasks were executed sequentially.
    
    ```
    Original:
    Run 1: 1.5 min
    Run 2: 1.4 min
    Run 3: 1.5 min
    
    With this change:
    Run 1: 1.2 min
    Run 2: 1.2 min
    Run 3: 1.2 min
    ```
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2603

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark removeToMapToList

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1504.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1504
    
----
commit d1abdb8dc8dc2c8d9abedd5e2f8eff5f1f754e2b
Author: Yin Huai <[email protected]>
Date:   2014-07-21T00:29:32Z

    Remove unnecessary toMap and toList.

commit 09b9bca2773c259a8d660a302252b994f1bf821e
Author: Yin Huai <[email protected]>
Date:   2014-07-21T01:01:54Z

    Merge remote-tracking branch 'upstream/master' into removeToMapToList

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to