Github user tigerquoll commented on the pull request:
https://github.com/apache/spark/pull/5250#issuecomment-87650287
Hi Sean,
Thanks for your input - your views have helped me refine my thinking on the
matter. I believe that If you take a purist's point of view then yes you can
say the source of the problem (likely) is with the data producer and should be
fixed at the data producer's end.
The point being is that this is a problem that is affecting many spark
users right now, and many users are not in control of the source system of the
data they are analysing and are forced to 'make do' with what they have. You
call this solution a band-aid - but many ETL solutions are a bandaid - but
providing this functionality is useful and serves a purpose for the end-user.
Are you concerned that swallowing an exception could leave the hadoop input
libraries in an inconsistent state, causing more data corruption? This will
not happen because swallowing the exception triggers the immediate finish of
the file reading task and no more data will be read by the task.
Are you concerned that swallowing an exception indicates that something has
potentially gone wrong earlier in the hadoop input read, and that previous data
could have been corrupted? The user already knows this is potentially the case
because running the application without this option enabled has caused the
application to terminate in the first place.
The fact that we are being more permissive of potentially corrupt data is a
show stopper for this being default behaviour - but I'm not proposing this be
default behaviour, I'm proposing this be a last-ditch option that an advanced
user can knowingly enable when attempting to deal with corrupted data, with the
understanding that their data could be made worse, but most likely corrupt data
will be omitted.
The alternative is to tell them that their data is not suitable for being
loaded into spark and perhaps they should use another tool or tell the data
system owner to fix their data feeds and get back to them with another data set
some time in the future. I know which option I would prefer if given the
choice - don't let perfect be the enemy of good.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]