[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

JoshRosen Wed, 27 Apr 2016 20:49:03 -0700

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/12750


    [SPARK-14972] Improve performance of JSON schema inference's compatibleType 
method

    This patch improves the performance of `InferSchema.compatibleType` and 
`inferField`. The net result of this patch is a 2x-4x speedup in local 
benchmarks running against cached data with a massive nested schema.
    
    The key idea is to remove unnecessary sorting in `compatibleType`'s 
`StructType` merging code. This code takes two structs, merges the fields with 
matching names, and copies over the unique fields, producing a new schema which 
is the union of the two structs' schemas. Previously, this code performed a 
very inefficient `groupBy()` to match up fields with the same name, but this is 
unnecessary because `inferField` already sorts structs' fields by name: since 
both lists of fields are sorted, we can simply merge them in a single pass.
    
    This patch also speeds up the existing field sorting in `inferField`: the 
old sorting code allocated unnecessary intermediate collections, while the new 
code uses mutable collects and performs in-place sorting.
    
    Finally, I replaced a `treeAggregate` call with `fold`: I doubt that 
`treeAggregate` will benefit us very much because the schemas would have to be 
enormous to realize large savings in network traffic. Since most schemas are 
probably fairly small in serialized form, they should typically fit within a 
direct task result and therefore can be incrementally merged at the driver as 
individual tasks finish. This change eliminates an entire (short) scheduler 
stage. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark schema-inference-speedups

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12750
    
----
commit 5406092f5c16189f1b6c46c1ed324f13dadd57b1
Author: Josh Rosen <[email protected]>
Date:   2016-04-28T02:12:29Z

    Stop using groupByKey to merge struct schemas.

commit 5319a6abfb37cb43c500385b21e9d905e831d24d
Author: Josh Rosen <[email protected]>
Date:   2016-04-28T02:38:21Z

    Take advantage of fact that structs' fields are already sorted by name.

commit d9793f71cf5d9e2ff5bf7d62b561e9d6e377d1a1
Author: Josh Rosen <[email protected]>
Date:   2016-04-28T03:01:47Z

    Perform in-place sort without additional allocation.

commit 4bbf4292802e475d84ec55994a4ebae3ddc2f4da
Author: Josh Rosen <[email protected]>
Date:   2016-04-28T03:04:27Z

    Replace treeAggregate with fold.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Reply via email to