[ 
https://issues.apache.org/jira/browse/BEAM-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7131:
------------------------------
    Description: 
I've been trying to run the TFX Chicago taxi example [1] on the Spark portable 
runner. TFDV works fine, but the preprocess step (preprocess_flink.sh [2]) 
fails with the following error:

RuntimeError: AlreadyExistsError: file already exists [while running 
'WriteTransformFn/WriteTransformFn']

Assets are being written multiple times to different temp directories, which is 
okay, but the error occurs when they are copied to the same permanent output 
directory. Specifically, the copy tree operation in transform_fn_io.py [3] is 
run twice with the same output directory. The error doesn't occur when that 
code is modified to allow overwriting existing files, but that's only a shallow 
fix. The deeper problem here seems to be that the Spark runner is repeating 
work for some reason.

[1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]

[2] 
[https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]

[3] 
[https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]

  was:
I've been trying to run the TFX Chicago taxi example [1] on the Spark portable 
runner. TFDV works fine, but the preprocess step (preprocess_flink.sh [2]) 
fails with the following error:

RuntimeError: AlreadyExistsError: file already exists [while running 
'WriteTransformFn/WriteTransformFn']

The copy tree operation in transform_fn_io.py [3] is seemingly being run twice. 
This problem doesn't occur when that code is modified to allow overwriting 
existing files, but that's only a shallow fix. The deeper problem here seems to 
be that the Spark runner is repeating work for some reason.

[1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]

[2] 
[https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]

[3] 
[https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]


> Spark portable runner appears to be repeating work (in TFX example)
> -------------------------------------------------------------------
>
>                 Key: BEAM-7131
>                 URL: https://issues.apache.org/jira/browse/BEAM-7131
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Weaver
>            Assignee: Kyle Weaver
>            Priority: Major
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. The deeper problem here seems to be that the Spark runner is 
> repeating work for some reason.
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to