Robert Bradshaw created BEAM-6243:
-------------------------------------
Summary: TFX pipelines experience a huge blowup in intermediate
data size
Key: BEAM-6243
URL: https://issues.apache.org/jira/browse/BEAM-6243
Project: Beam
Issue Type: Sub-task
Components: runner-flink
Reporter: Robert Bradshaw
The elements in TFX intermediate collections are dictionaries of (typically
single-element) numpy arrays, which are (relatively) expensive to serialize
(e.g. using pickle for the numpy wrapper of a primitive int/float, repeating
the column names in every element).
Though it'd be good to use a better intermediate representation, this is
exacerbated because the fusion algorithm does not pack as much possible into
executable stages.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)