[GitHub] spark pull request #14028: [SPARK-16351][SQL] Avoid record-per type dispatch...

HyukjinKwon Sat, 02 Jul 2016 05:28:38 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/14028


    [SPARK-16351][SQL] Avoid record-per type dispatch in JSON when writing

    ## What changes were proposed in this pull request?
    
    Currently, `JacksonGenerator.apply` is doing type-based dispatch for each 
row to write appropriate values.
    It might not have to be done like this because the schema is already kept.
    
    So, appropriate writers can be created first according to the schema, and 
then apply them to each row. This approach is similar with 
`CatalystWriteSupport`.
    
    This PR corrects `JacksonGenerator` so that it creates some writes for the 
schema and then applies them to each row rather than every time.
    
    Benchmark was proceeded with the codes below:
    
    ```
    test("Benchmark for JSON writer") {
      val N = 500 << 8
      val row =
        """{"struct":{"field1": true, "field2": 92233720368547758070},
            "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", 
"str2"]},
            "arrayOfString":["str1", "str2"],
            "arrayOfInteger":[1, 2147483647, -2147483648],
            "arrayOfLong":[21474836470, 9223372036854775807, 
-9223372036854775808],
            "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
            "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 
2.2250738585072014E-308],
            "arrayOfBoolean":[true, false, true],
            "arrayOfNull":[null, null, null, null],
            "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": 
false}, {"field3": null}],
            "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
            "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
           }"""
      val df = 
spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
      (0 to 10).foreach { _ =>
        val benchmark = new Benchmark("JSON writer", N)
        benchmark.addCase("writing JSON file", 10) { iter =>
          withTempPath { path =>
            df.write.format("json").save(path.getCanonicalPath)
          }
        }
        benchmark.run()
      }
    }
    ```
    
    This produced the results below
    
    - **Before**
    
    ```
    JSON writer:                             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    writing JSON file                             1675 / 1767          0.1      
 13087.5       1.0X
    ```
    
    - **After**
    
    ```
    JSON writer:                             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    writing JSON file                             1597 / 1686          0.1      
 12477.1       1.0X
    ```
    
    In addition, I ran this benchmarks 10 times for each and calculated the 
average times as below:
    
    | **Before** | **After**|
    |---------------|------------|
    |17478ms  |16669ms |
    
    
    It seems roughly ~5% is improved.
    
    ## How was this patch tested?
    
    Existing tests should cover this.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-16351

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14028.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14028
    
----
commit 76ca05a383b1b3f6deea266276af872ab3a18f36
Author: hyukjinkwon <[email protected]>
Date:   2016-07-02T12:16:25Z

    Avoid record-per type dispatch in JSON when writing

commit 34ec476b5afad926db89a03ff64e6cda8263ee86
Author: hyukjinkwon <[email protected]>
Date:   2016-07-02T12:18:03Z

    Keep the comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14028: [SPARK-16351][SQL] Avoid record-per type dispatch...

Reply via email to