Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/5250#issuecomment-87653690
I don't know if corrupted gzip files are such a common problem, but I'm not
sure that would change the logic about where to fix things. It is a problem
with the preceding ETL process, yes. Something else needs to explicitly check
and/or fix the input first if this is a problem.
I suppose my point too is that this change does not just address the
proposed problem with gzip files. It treats any error as recoverable.
It's nothing to do with inconsistent state. It's the presenting a
successful result that is actually silently missing input, which might not even
be deterministic. This seems way more problematic than reliably failing-fast
and, yes, making you fix your upstream process.
Hiding behind a flag only goes so far. It's documented (or else how many
people does it help?). It becomes a code path that has to be supported for a
long time. It is presented to users as a fine thing to do when I don't believe
it is. It's not the good being the enemy of the perfect, but the dangerous
being the enemy of the good.
This is nothing to do with telling people they can't use Spark, or have to
fix an unfixable upstream process. This is about appropriately dealing with bad
upstream data in the right place, and this is not how to do it.
Specifically: why not write a process that just opens a stream on each
input file in turn and tries to read a handful of bytes? if it fails, delete
the file or do what you like with it. This is maybe 10 lines of code in your
driver.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]