[ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539770#comment-14539770
 ] 

Joe Halliwell edited comment on SPARK-7366 at 5/12/15 12:36 PM:
----------------------------------------------------------------

Thanks for the link.

I'd planned to solve the "initial state" (i.e. are we in quotes or not) problem 
by running two JSON lexers in parallel. The lower footprint of ESRI's 
hand-tooled scanner is certainly attractive, but I think it's less obviously 
correct, and I think a full lexer is probably required for the next bit: 
identifying the first top-level record.

My initial suggestion was to use a "maximum depth" hint to pick out top-level 
records: we could fast forward the split to the first complete record of the 
specified depth. This is certainly feasible, but the slight flexibility offered 
by maximum vs fixed depth may not be worth the additional code/documentation 
complexity.

But on reflection, I think the ESRI approach of looking for a (set of) 
"top-level" field(s) is easier to explain, and more useful!

I propose to implement that instead:
- with user-specified top-level fields
- using a lexer vs regex to drive the process
- (probably) a user-specified maximum object size

I'm planning to implement this in Java as a Hadoop InputFormat -- at least 
initially -- but I think that stops short of supporting the Spark use cases. 
I'd really welcome some pointers on how best to get this working nicely with 
SparkSQL.






was (Author: joehalliwell):
Thanks for the link.

I'd planned to solve the "initial state" (i.e. are we in quotes or not) problem 
by running two JSON lexers in parallel. The lower footprint of ESRI's 
hand-tooled scanner is certainly attractive, but I think it's less obviously 
correct, and I think a full lexer is probably required for the next bit: 
identifying the first top-level record.

My initial suggestion was to use a "maximum depth" hint to pick out top-level 
records: we could fast forward the split to the first complete record of the 
specified depth. This is certainly feasible, but the slight flexibility offered 
by maximum vs fixed depth may not be worth the additional code/documentation 
complexity.

But on reflection, I think the ESRI approach of looking for a (set of) 
"top-level" field(s) is easier to explain, and more useful!

I propose to implement that instead:
- with user-specified top-level fields
- using a lexer vs regex to drive the process
- (probably) a user-specified maximum object size

I'm planning to implement this in Java as a Hadoop InputFormat -- at least 
initially -- but I think that stops short of supporting the Spark use cases. 
I'd really welcome some pointers on how best to this working nicely with 
SparkSQL.





> Support multi-line JSON objects
> -------------------------------
>
>                 Key: SPARK-7366
>                 URL: https://issues.apache.org/jira/browse/SPARK-7366
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>            Reporter: Joe Halliwell
>            Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to