[ https://issues.apache.org/jira/browse/BEAM-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Bradshaw updated BEAM-6243: ---------------------------------- Issue Type: Bug (was: Sub-task) Parent: (was: BEAM-6015) > TFX pipelines experience a huge blowup in intermediate data size > ---------------------------------------------------------------- > > Key: BEAM-6243 > URL: https://issues.apache.org/jira/browse/BEAM-6243 > Project: Beam > Issue Type: Bug > Components: runner-flink > Reporter: Robert Bradshaw > Priority: Major > > The elements in TFX intermediate collections are dictionaries of (typically > single-element) numpy arrays, which are (relatively) expensive to serialize > (e.g. using pickle for the numpy wrapper of a primitive int/float, repeating > the column names in every element). > Though it'd be good to use a better intermediate representation, this is > exacerbated because the fusion algorithm does not pack as much possible into > executable stages. -- This message was sent by Atlassian JIRA (v7.6.3#76005)