[jira] [Commented] (SPARK-7366) Support multi-line JSON objects

Lars Francke (JIRA) Tue, 05 May 2015 07:14:09 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528489#comment-14528489
 ]


Lars Francke commented on SPARK-7366:
-------------------------------------

This would be very helpful. ESRI has a RecordReader that could serve as 
inspiration as well. It does a reasonable job of finding boundaries on a JSON 
subset: 
https://github.com/Esri/spatial-framework-for-hadoop/blob/master/json/src/main/java/com/esri/json/hadoop/UnenclosedJsonRecordReader.java

> Support multi-line JSON objects
> -------------------------------
>
>                 Key: SPARK-7366
>                 URL: https://issues.apache.org/jira/browse/SPARK-7366
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>            Reporter: Joe Halliwell
>            Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-7366) Support multi-line JSON objects

Reply via email to