[
https://issues.apache.org/jira/browse/BEAM-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835962#comment-16835962
]
Kyle Weaver commented on BEAM-7131:
-----------------------------------
[~robertwb] Narrowed it down to the simplest pipeline that exhibits this
behavior:
If I have pcolls B and C that both depend on A, the Spark portable runner will
compute A B C A B C (whereas the Flink and legacy Spark runners compute only
once, A B C).
> Spark portable runner appears to be repeating work (in TFX example)
> -------------------------------------------------------------------
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Kyle Weaver
> Assignee: Kyle Weaver
> Priority: Major
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark
> portable runner. TFDV works fine, but the preprocess step
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which
> is okay, but the error occurs when they are copied to the same permanent
> output directory. Specifically, the copy tree operation in transform_fn_io.py
> [3] is run twice with the same output directory. The error doesn't occur when
> that code is modified to allow overwriting existing files, but that's only a
> shallow fix. While the TF transform should probably be made idempotent, this
> is also an issue with the Spark runner, which shouldn't be repeating work
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2]
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3]
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)