This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 0ecc71b  [SPARK-29871][ML] Catch all exceptions for handling invalid 
images in image source
0ecc71b is described below

commit 0ecc71bbf979f13e7260af93c4bffa8c133dc9ea
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Fri Oct 8 09:04:13 2021 +0900

    [SPARK-29871][ML] Catch all exceptions for handling invalid images in image 
source
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the test failure:
    
    ```
    Running tests...
    ----------------------------------------------------------------------
    test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... 
ERROR (12.050s)
    
    ======================================================================
    ERROR [12.050s]: test_read_images 
(pyspark.ml.tests.test_image.ImageFileFormatTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
    File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py",
 line 35, in test_read_images
    self.assertEqual(df.count(), 4)
    File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py",
 line 507, in count
    return int(self._jdf.count())
    File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1286, in _call_
    answer, self.gateway_client, self.target_id, self.name)
    File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", 
line 98, in deco
    return f(*a, **kw)
    File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 328, in get_return_value
    format(target_id, ".", name), value)
    py4j.protocol.Py4JJavaError: An error occurred while calling o32.count.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
(TID 1, amp-jenkins-worker-05.amp, executor driver): 
javax.imageio.IIOException: Unsupported Image Type
    at 
com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079)
    at 
com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050)
    at javax.imageio.ImageIO.read(ImageIO.java:1448)
    at javax.imageio.ImageIO.read(ImageIO.java:1352)
    ```
    
    This exception happens apparently when handling malformed invalid images 
with `dropInvalid` option set - `ImageIO.read` fails to catch 
`javax.imageio.IIOException` for an invalid image that is not 
`RuntimeException`.
    
    In fact, the bytes are already in memory so the real IO exception would not 
happen during `ImageIO.read`. Therefore, this PR proposes to catch all 
exceptions when reading image to properly handle malformed images.
    
    For the reason why it's flaky instead of consistently failing, I am not yet 
sure. However, the fix should be correct.
    
    ### Why are the changes needed?
    
    To fix the flaky tests, see https://github.com/apache/spark/runs/3802639160 
as an example.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Users would be able to read malformed data even for the cases of 
`javax.imageio.IIOException` (or other unexpected non-runtime exceptions) is 
thrown when `dropInvalid`  option is enabled.
    
    ### How was this patch tested?
    
    Existing unittests. We should track if the tests are still flaky or not.
    
    Closes #34187 from HyukjinKwon/SPARK-29871.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
index 37b7159..242496f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
@@ -133,9 +133,12 @@ object ImageSchema {
     val img = try {
       ImageIO.read(new ByteArrayInputStream(bytes))
     } catch {
-      // Catch runtime exception because `ImageIO` may throw unexpected 
`RuntimeException`.
-      // But do not catch the declared `IOException` (regarded as FileSystem 
failure)
-      case _: RuntimeException => null
+      // Note that:
+      // - At this point, the files are already read from the files as bytes. 
Therefore,
+      //   no real I/O exceptions are expected.
+      // - `ImageIO.read` can throw `javax.imageio.IIOException` that is 
technically
+      //   a runtime exception but it inherits IOException.
+      case _: Throwable => null
     }
 
     if (img == null) {

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to