Brian Hulette created BEAM-13335:
------------------------------------
Summary: DataFrame sources produce excessively large index
Key: BEAM-13335
URL: https://issues.apache.org/jira/browse/BEAM-13335
Project: Beam
Issue Type: Improvement
Components: dsl-dataframe
Reporter: Brian Hulette
Assignee: Robert Bradshaw
DataFrame reads attempt to match user expectations by giving every element
across all
shards a unique index. This is done by embedding the filepath
itself in the index, but this results in the (often quite long) path
being duplicated for every element (sometimes exceeding the size of the
data itself).
We should instead generate a guaranteed unique _numeric_ index.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)