[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

NathanHowell Tue, 27 Dec 2016 13:38:12 -0800

Github user NathanHowell commented on the issue:

    https://github.com/apache/spark/pull/16386
  
    @HyukjinKwon I agree that overloading the corrupt record column is 
undesirable and `F.input_file_name` is a better way to fetch the filename. It 
would be nice to extend this concept further and provide new functions (like 
`F.json_exception`) to retrieve exceptions and their locations, and this would 
work for the base case (parsing a string) as well as `wholeFile`. Plumbing this 
type of change through appears to require thread locale storage (unfortunately) 
but otherwise doesn't look too bad.
    
    The question then is what to put in the corrupt record column, if one is 
defined, when in `wholeFile` mode. To retain consistency with the string paths 
we should really put the entire file in the column. This is problematic for 
large files (>2GB) since Spark SQL doesn't have blob support... so the 
allocations will fail (along with the task) and there is no way for the end 
user to work around this limitation. Functions like `substr` are applied to 
byte arrays and not file streams. Perhaps it's good enough to issue a warning 
(along the lines of "don't define a corrupt record column in wholeFile mode" 
and hope for the best?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

Reply via email to