Brian Hulette created BEAM-13133:
------------------------------------
Summary: sample() imposes partitioning by index unnecessarily
Key: BEAM-13133
URL: https://issues.apache.org/jira/browse/BEAM-13133
Project: Beam
Issue Type: Task
Components: dsl-dataframe
Reporter: Brian Hulette
Assignee: Brian Hulette
I noticed that sample() requires data to repartitioned when it's used at the
beginning of a series of dataframe commands. In practice we should be able to
sample within arbitrary partitions before combining the partitions to produce
the final result.
It looks like the root cause is that our sample expressions require
partitioning by index, rather than arbitrary partitioning.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)