GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/16386
[SPARK-18352][SQL] Support parsing multiline json files
## What changes were proposed in this pull request?
If a new option `wholeFile` is set to `true` the JSON reader will parse
each file (instead of a single line) as a value. This is done with Jackson
streaming and it should be capable of parsing very large documents, assuming
the row will fit in memory.
Because the file is not buffered in memory the corrupt record handling is
also slightly different when `wholeFile` is enabled: the corrupt column will
contain the filename instead of the literal JSON if there is a parsing failure.
It would be easy to extend this to add the parser location (line, column and
byte offsets) to the output if desired.
I've also included a few other changes that generate slightly better
bytecode and (imo) make it more obvious when and where boxing is occurring in
the parser. These are included as separate commits, let me know if they should
be flattened into this PR or moved to a new one.
## How was this patch tested?
New and existing unit tests. No performance or load tests have been run.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/NathanHowell/spark SPARK-18352
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16386.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16386
----
commit 740620210b30ef02e280d161d6b08088d07300fa
Author: Nathan Howell <[email protected]>
Date: 2016-12-22T22:16:49Z
[SPARK-18352][SQL] Support parsing multiline json files
commit 7902255a79fc2581214a09ccd38437cebd19d862
Author: Nathan Howell <[email protected]>
Date: 2016-12-22T00:27:19Z
JacksonParser.parseJsonToken should be explicit about nulls and boxing
commit 149418647c9831e88af866d44d31496940c02162
Author: Nathan Howell <[email protected]>
Date: 2016-12-21T23:49:37Z
Increase type safety of makeRootConverter, remove runtime type tests
commit 7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695
Author: Nathan Howell <[email protected]>
Date: 2016-12-23T02:13:59Z
Field converter lookups should be O(1)
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]