GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/5801
[SPARK-5938][SQL] Improve JsonRDD performance
This patch comprises of a few related pieces of work:
* Schema inference is performed directly on the JSON token stream
* `String => Row` conversion populate Spark SQL structures without
intermediate types
* Projection pushdown is implemented via CatalystScan for DataFrame queries
I've run some basic queries on a 300MB/100k row dataset with a flat schema
and the results are promising:
* Before: ```INFO DAGScheduler: Job 8 finished: count at <console>:20, took
2.916653 s```
* After: ```INFO DAGScheduler: Job 8 finished: count at <console>:20, took
2.184896 s```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/NathanHowell/spark json-performance
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5801.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5801
----
commit 1e441e23a2cfd8712720a728056e363e41538d1f
Author: Nathan Howell <[email protected]>
Date: 2015-04-29T05:44:19Z
Eliminate arrow pattern, replace with pattern matches
commit 73a56927d09c670eb62317f611c47a90096fe693
Author: Nathan Howell <[email protected]>
Date: 2015-04-27T22:38:28Z
Improve JSON parsing and type inference performance
commit 1abf1d6010c71cd1cffa97d7564f8fb71eb19f10
Author: Nathan Howell <[email protected]>
Date: 2015-04-30T02:16:33Z
Add projection pushdown support to JsonRDD
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]