Github user tigerquoll commented on a diff in the pull request:
https://github.com/apache/spark/pull/5250#discussion_r27377569
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -246,6 +249,15 @@ class HadoopRDD[K, V](
} catch {
case eof: EOFException =>
finished = true
+ case e: Exception =>
--- End diff --
Having been on the receiving end of things I know that the gzip module
throws an IOException, but unfortunately I have no knowledge over what the
Hadoop input modules and what exceptions they throw, or if they propagate
exceptions up from other 3rd party libraries. Catching such a broad exception
is mitigated by the fact that this particular option defaults to off, and
should only be enabled when you are trying to parse files that you know are
corrupt. Given the situation, when you turn the option on, we should really
try to finish processing files to the best of our ability, thus I think in this
case catching 'Exception' might be appropriate.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]