Kyle Weaver created BEAM-7131:
---------------------------------
Summary: Spark portable runner appears to be repeating work (in
TFX example)
Key: BEAM-7131
URL: https://issues.apache.org/jira/browse/BEAM-7131
Project: Beam
Issue Type: Improvement
Components: runner-spark
Reporter: Kyle Weaver
Assignee: Kyle Weaver
I've been trying to run the TFX Chicago taxi example [1] on the Spark portable
runner. TFDV works fine, but the preprocess step (preprocess_flink.sh [2])
fails with the following error:
RuntimeError: AlreadyExistsError: file already exists [while running
'WriteTransformFn/WriteTransformFn']
The copy tree operation in transform_fn_io.py [3] is seemingly being run twice.
This problem doesn't occur when that code is modified to allow overwriting
existing files, but that's only a shallow fix. The deeper problem here seems to
be that the Spark runner is repeating work for some reason.
[1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
[2]
[https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
[3]
[https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)