[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

NathanHowell Fri, 23 Dec 2016 10:11:19 -0800

Github user NathanHowell commented on the issue:

    https://github.com/apache/spark/pull/16386
  
    @srowen It is functionally the same as what you're suggesting. The question 
is how (or if) it should it be first class in the `DataFrameReader` api. If we 
agree that it should be exposed, either via a new `FileFormat` or an option to 
`JsonFileFormat`, some abstraction is necessary to support reading from 
different RDD classes.
    
    This PR just pushes that boundary a little further and let's the inference 
and parser code work over more types, not just `String`. This may make parsing 
more efficient in the line oriented codepath by avoiding a conversion from 
`Text` and `UTF8String` (in `JsonToStruct`) to `String`, and also lets us parse 
an `InputStream` without requiring all of the data to be in memory. For small 
files it's not likely to have a benefit (if the file is smaller than 4k it will 
be read entirely anyways) but as the file size increases this reduces the 
amount of memory required for parsing, is friendlier (in theory) on the GC and 
let's us consume files larger than 2GB.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

Reply via email to