Robert Bradshaw created BEAM-6243:
-------------------------------------

             Summary: TFX pipelines experience a huge blowup in intermediate 
data size
                 Key: BEAM-6243
                 URL: https://issues.apache.org/jira/browse/BEAM-6243
             Project: Beam
          Issue Type: Sub-task
          Components: runner-flink
            Reporter: Robert Bradshaw


The elements in TFX intermediate collections are dictionaries of (typically 
single-element) numpy arrays, which are (relatively) expensive to serialize 
(e.g. using pickle for the numpy wrapper of a primitive int/float, repeating 
the column names in every element).

Though it'd be good to use a better intermediate representation, this is 
exacerbated because the fusion algorithm does not pack as much possible into 
executable stages. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to