[ 
https://issues.apache.org/jira/browse/BEAM-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7131:
------------------------------
    Issue Type: Bug  (was: Improvement)

> Spark portable runner appears to be repeating work (in TFX example)
> -------------------------------------------------------------------
>
>                 Key: BEAM-7131
>                 URL: https://issues.apache.org/jira/browse/BEAM-7131
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Weaver
>            Assignee: Kyle Weaver
>            Priority: Major
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> The copy tree operation in transform_fn_io.py [3] is seemingly being run 
> twice. This problem doesn't occur when that code is modified to allow 
> overwriting existing files, but that's only a shallow fix. The deeper problem 
> here seems to be that the Spark runner is repeating work for some reason.
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to