[ https://issues.apache.org/jira/browse/BEAM-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Beam JIRA Bot updated BEAM-6243: -------------------------------- Labels: stale-P2 (was: ) > TFX pipelines experience a huge blowup in intermediate data size > ---------------------------------------------------------------- > > Key: BEAM-6243 > URL: https://issues.apache.org/jira/browse/BEAM-6243 > Project: Beam > Issue Type: Bug > Components: runner-flink > Reporter: Robert Bradshaw > Priority: P2 > Labels: stale-P2 > Time Spent: 2h 20m > Remaining Estimate: 0h > > The elements in TFX intermediate collections are dictionaries of (typically > single-element) numpy arrays, which are (relatively) expensive to serialize > (e.g. using pickle for the numpy wrapper of a primitive int/float, repeating > the column names in every element). > Though it'd be good to use a better intermediate representation, this is > exacerbated because the fusion algorithm does not pack as much possible into > executable stages. -- This message was sent by Atlassian Jira (v8.3.4#803005)