[ 
https://issues.apache.org/jira/browse/BEAM-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver resolved BEAM-7131.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 2.13.0

> Spark portable runner appears to be repeating work (in TFX example)
> -------------------------------------------------------------------
>
>                 Key: BEAM-7131
>                 URL: https://issues.apache.org/jira/browse/BEAM-7131
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Weaver
>            Assignee: Kyle Weaver
>            Priority: Major
>             Fix For: 2.13.0
>
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to