Joe Halliwell created SPARK-7366:
------------------------------------
Summary: Support multi-line JSON objects via a depth hint
Key: SPARK-7366
URL: https://issues.apache.org/jira/browse/SPARK-7366
Project: Spark
Issue Type: Improvement
Components: Input/Output
Reporter: Joe Halliwell
Priority: Minor
The present object-per-line format for ingesting JSON data has a couple of
deficiencies:
1. It's not itself JSON
2. It's often harder for humans to read
The object-per-file format addresses these, but at a cost of producing many
files which can be unwieldy.
Since it is feasible to read and write large JSON files via streaming (and many
systems do) it seems reasonable to support them directly as an input format.
The key challenge is to find record boundaries without parsing the file from
the start i.e. given an offset, locate a nearby boundary. In the general case
this is impossible as you can't be sure you've identified the start of a
top-level record without tracing back to the start of a file.
However, if you know something about the format of the file i.e. maximum object
depth it seems plausible that we can do better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]