[ 
https://issues.apache.org/jira/browse/BEAM-12857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477130#comment-17477130
 ] 

Sam Whittle commented on BEAM-12857:
------------------------------------

>From looking at code it does seem that that such an exception could be 
>encountered with the following:
ignoreMissingSrc was false
skipExistingDest was true
matchSrcResults.get(i).status() was NOT_FOUND and metadata was empty/null
matchDestResults.get(i).status().equals(Status.OK)

This would trigger if the src was missing and the destination already existed.  
This could happen if the file rename stage ran multiple times due to a retry.

https://github.com/apache/beam/pull/15301 changed so that the source files are 
only matched if ignoreMissingSrc was true, so this but itself woudn't occur 
around the metadata.  However there is a similar bug still present if 
ignoreMissingSrc was false and skipExistingDest is true since that examines the 
src results without populating them.

> Unable to write to GCS due to IndexOutOfBoundsException in FileSystems
> ----------------------------------------------------------------------
>
>                 Key: BEAM-12857
>                 URL: https://issues.apache.org/jira/browse/BEAM-12857
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>    Affects Versions: 2.31.0, 2.32.0
>         Environment: Beam 2.31.0/2.32.0, Java 11, GCP Dataflow
>            Reporter: Patrick Lucas
>            Priority: P2
>
> I have a simple batch job, running on Dataflow, that reads from a GCS bucket, 
> filters the data, and windows and writes the matching data back to a 
> different path in the same bucket.
> The job seems to succeed in reading and filtering the data, as well as 
> writing temporary files to GCS, but appears to fail when trying to rename the 
> temporary files to their final destination.
> The IndexOutOfBoundsException is thrown from 
> [FileSystems.java:429|https://github.com/apache/beam/blob/v2.32.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L429]
>  (in 2.32.0), when the code calls {{.get(0)}} on the list returned by a call 
> to {{MatchResult#metadata()}}.
> The javadoc for 
> [{{MatchResult#metadata()}}|https://github.com/apache/beam/blob/v2.32.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/MatchResult.java#L75-L80]
>  says,
> {code:java}
>   /**
>    * {@link Metadata} of matched files. Note that if {@link #status()} is 
> {@link Status#NOT_FOUND},
>    * this may either throw a {@link java.io.FileNotFoundException} or return 
> an empty list,
>    * depending on the {@link EmptyMatchTreatment} used in the {@link 
> FileSystems#match} call.
>    */
> {code}
> So possibly GCS is not returning any metadata for the (missing) destination 
> object? That seems unlikely, as I would expect many others would have already 
> run into this, but I don't see how this could be caused by my user code.
> I have tested this on 2.31.0 and 2.32.0 getting the same error, but it's 
> worth noting that the logic in FileSystems.java changed a decent amount 
> recently in [#15301|https://github.com/apache/beam/pull/15301], maybe having 
> an effect on this, but I haven't been able to test it since I'm working in a 
> closed environment and can only easily use released versions of Beam. Once a 
> version containing this change is released, I will upgrade and try again.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to