GitHub user NathanHowell opened a pull request:

    https://github.com/apache/spark/pull/16386

    [SPARK-18352][SQL] Support parsing multiline json files

    ## What changes were proposed in this pull request?
    
    If a new option `wholeFile` is set to `true` the JSON reader will parse 
each file (instead of a single line) as a value. This is done with Jackson 
streaming and it should be capable of parsing very large documents, assuming 
the row will fit in memory.
    
    Because the file is not buffered in memory the corrupt record handling is 
also slightly different when `wholeFile` is enabled: the corrupt column will 
contain the filename instead of the literal JSON if there is a parsing failure. 
It would be easy to extend this to add the parser location (line, column and 
byte offsets) to the output if desired.
    
    I've also included a few other changes that generate slightly better 
bytecode and (imo) make it more obvious when and where boxing is occurring in 
the parser. These are included as separate commits, let me know if they should 
be flattened into this PR or moved to a new one.
    
    ## How was this patch tested?
    
    New and existing unit tests. No performance or load tests have been run.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NathanHowell/spark SPARK-18352

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16386.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16386
    
----
commit 740620210b30ef02e280d161d6b08088d07300fa
Author: Nathan Howell <[email protected]>
Date:   2016-12-22T22:16:49Z

    [SPARK-18352][SQL] Support parsing multiline json files

commit 7902255a79fc2581214a09ccd38437cebd19d862
Author: Nathan Howell <[email protected]>
Date:   2016-12-22T00:27:19Z

    JacksonParser.parseJsonToken should be explicit about nulls and boxing

commit 149418647c9831e88af866d44d31496940c02162
Author: Nathan Howell <[email protected]>
Date:   2016-12-21T23:49:37Z

    Increase type safety of makeRootConverter, remove runtime type tests

commit 7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695
Author: Nathan Howell <[email protected]>
Date:   2016-12-23T02:13:59Z

    Field converter lookups should be O(1)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to