[ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359610#comment-15359610
 ] 

Max Moroz commented on SPARK-7366:
----------------------------------

In any realistic situation, the user will study the reduced version of the file 
before they try to parse the entire huge file. It's super easy to figure out 
the top level fields from such a reduced file, so I think it's perfect to 
require them as an argument to the parser. OTOH, the maximum depth found in the 
reduced file and the maximum object size may be unreliable (since the complete 
file may have larger numbers than the tiny fragment of the file), so ideally 
I'd avoid maximum object size as an argument; is it really that important for 
performance?

Another issue I can think of, is what if the file is malformed. Obviously, 
Spark can just raise an exception, but it's unfortunate if the file is huge, 
and the user actually doesn't mind skipping some incorrect parts. If at all 
possible, it would be nice to offer the user a choice between "raise exception" 
and "recover from errors by skipping the invalid parts".

By the way, XML presents a very similar problem. Is any proposal about it?

> Support multi-line JSON objects
> -------------------------------
>
>                 Key: SPARK-7366
>                 URL: https://issues.apache.org/jira/browse/SPARK-7366
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>            Reporter: Joe Halliwell
>            Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to