Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@HyukjinKwon I agree that overloading the corrupt record column is
undesirable and `F.input_file_name` is a better way to fetch the filename. It
would be nice to extend this concept further and provide new functions (like
`F.json_exception`) to retrieve exceptions and their locations, and this would
work for the base case (parsing a string) as well as `wholeFile`. Plumbing this
type of change through appears to require thread locale storage (unfortunately)
but otherwise doesn't look too bad.
The question then is what to put in the corrupt record column, if one is
defined, when in `wholeFile` mode. To retain consistency with the string paths
we should really put the entire file in the column. This is problematic for
large files (>2GB) since Spark SQL doesn't have blob support... so the
allocations will fail (along with the task) and there is no way for the end
user to work around this limitation. Functions like `substr` are applied to
byte arrays and not file streams. Perhaps it's good enough to issue a warning
(along the lines of "don't define a corrupt record column in wholeFile mode"
and hope for the best?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]