Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@srowen It is functionally the same as what you're suggesting. The question
is how (or if) it should it be first class in the `DataFrameReader` api. If we
agree that it should be exposed, either via a new `FileFormat` or an option to
`JsonFileFormat`, some abstraction is necessary to support reading from
different RDD classes.
This PR just pushes that boundary a little further and let's the inference
and parser code work over more types, not just `String`. This may make parsing
more efficient in the line oriented codepath by avoiding a conversion from
`Text` and `UTF8String` (in `JsonToStruct`) to `String`, and also lets us parse
an `InputStream` without requiring all of the data to be in memory. For small
files it's not likely to have a benefit (if the file is smaller than 4k it will
be read entirely anyways) but as the file size increases this reduces the
amount of memory required for parsing, is friendlier (in theory) on the GC and
let's us consume files larger than 2GB.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]