[ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Halliwell updated SPARK-7366:
---------------------------------
    Description: 
h2. Background

The present object-per-line format for ingesting JSON data has a couple of 
deficiencies:
1. It's not itself JSON
2. It's often harder for humans to read

The object-per-file format addresses these, but at a cost of producing many 
files which can be unwieldy.

Since it is feasible to read and write large JSON files via streaming (and many 
systems do) it seems reasonable to support them directly as an input format.

h2. Suggested approach

The key challenge is to find record boundaries without parsing the file from 
the start i.e. given an offset, locate a nearby boundary. In the general case 
this is impossible: you can't be sure you've identified the start of a 
top-level record without tracing back to the start of the file.

However, if we know something more of the structure of the file i.e. maximum 
object depth it seems plausible that we can do better.

  was:
.h2 Background

The present object-per-line format for ingesting JSON data has a couple of 
deficiencies:
1. It's not itself JSON
2. It's often harder for humans to read

The object-per-file format addresses these, but at a cost of producing many 
files which can be unwieldy.

Since it is feasible to read and write large JSON files via streaming (and many 
systems do) it seems reasonable to support them directly as an input format.

.h2 Suggested approach

The key challenge is to find record boundaries without parsing the file from 
the start i.e. given an offset, locate a nearby boundary. In the general case 
this is impossible: you can't be sure you've identified the start of a 
top-level record without tracing back to the start of the file.

However, if we know something more of the structure of the file i.e. maximum 
object depth it seems plausible that we can do better.


> Support multi-line JSON objects
> -------------------------------
>
>                 Key: SPARK-7366
>                 URL: https://issues.apache.org/jira/browse/SPARK-7366
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>            Reporter: Joe Halliwell
>            Priority: Minor
>
> h2. Background
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to