[
https://issues.apache.org/jira/browse/BEAM-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian Hulette updated BEAM-11881:
---------------------------------
Component/s: dsl-dataframe
> DataFrame subpartitioning order is incorrect
> --------------------------------------------
>
> Key: BEAM-11881
> URL: https://issues.apache.org/jira/browse/BEAM-11881
> Project: Beam
> Issue Type: Bug
> Components: dsl-dataframe, sdk-py-core
> Reporter: Brian Hulette
> Assignee: Brian Hulette
> Priority: P2
> Labels: dataframe-api
> Fix For: 2.29.0
>
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> Currently we've defined
> Nothing() < Index([i]) < Index([i,j]) < .. < Index() < Singleton()
> s.t. Singleton is a subpartitoning of Index, is a subpartitioning of
> Index([i,j]), but this is incorrect. The order should be
> Singleton() < Index([i]) < Index([i,j]) < .. < Index() < Nothing()
> s.t. every other partitioning is a subpartitioning of Singleton. This is
> logical, since Singleton will collect the largest amount of data on a single
> node, partitioning by a single index will be alittle more distributed, and
> partitioning by the full Index() will be the most distribtued.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)