[
https://issues.apache.org/jira/browse/BEAM-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kyle Weaver updated BEAM-7131:
------------------------------
Issue Type: Bug (was: Improvement)
> Spark portable runner appears to be repeating work (in TFX example)
> -------------------------------------------------------------------
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Kyle Weaver
> Assignee: Kyle Weaver
> Priority: Major
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark
> portable runner. TFDV works fine, but the preprocess step
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running
> 'WriteTransformFn/WriteTransformFn']
> The copy tree operation in transform_fn_io.py [3] is seemingly being run
> twice. This problem doesn't occur when that code is modified to allow
> overwriting existing files, but that's only a shallow fix. The deeper problem
> here seems to be that the Spark runner is repeating work for some reason.
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2]
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3]
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)