GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/12750
[SPARK-14972] Improve performance of JSON schema inference's compatibleType
method
This patch improves the performance of `InferSchema.compatibleType` and
`inferField`. The net result of this patch is a 2x-4x speedup in local
benchmarks running against cached data with a massive nested schema.
The key idea is to remove unnecessary sorting in `compatibleType`'s
`StructType` merging code. This code takes two structs, merges the fields with
matching names, and copies over the unique fields, producing a new schema which
is the union of the two structs' schemas. Previously, this code performed a
very inefficient `groupBy()` to match up fields with the same name, but this is
unnecessary because `inferField` already sorts structs' fields by name: since
both lists of fields are sorted, we can simply merge them in a single pass.
This patch also speeds up the existing field sorting in `inferField`: the
old sorting code allocated unnecessary intermediate collections, while the new
code uses mutable collects and performs in-place sorting.
Finally, I replaced a `treeAggregate` call with `fold`: I doubt that
`treeAggregate` will benefit us very much because the schemas would have to be
enormous to realize large savings in network traffic. Since most schemas are
probably fairly small in serialized form, they should typically fit within a
direct task result and therefore can be incrementally merged at the driver as
individual tasks finish. This change eliminates an entire (short) scheduler
stage.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark schema-inference-speedups
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12750.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12750
----
commit 5406092f5c16189f1b6c46c1ed324f13dadd57b1
Author: Josh Rosen <[email protected]>
Date: 2016-04-28T02:12:29Z
Stop using groupByKey to merge struct schemas.
commit 5319a6abfb37cb43c500385b21e9d905e831d24d
Author: Josh Rosen <[email protected]>
Date: 2016-04-28T02:38:21Z
Take advantage of fact that structs' fields are already sorted by name.
commit d9793f71cf5d9e2ff5bf7d62b561e9d6e377d1a1
Author: Josh Rosen <[email protected]>
Date: 2016-04-28T03:01:47Z
Perform in-place sort without additional allocation.
commit 4bbf4292802e475d84ec55994a4ebae3ddc2f4da
Author: Josh Rosen <[email protected]>
Date: 2016-04-28T03:04:27Z
Replace treeAggregate with fold.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]