gitmodimo commented on PR #44083:
URL: https://github.com/apache/arrow/pull/44083#issuecomment-2403347394
Let me pitch in. Disclaimer I am working @mroz45 on the same project using
arrow.
```
-bool require_sequenced_output = false)
+bool require_sequenced_output = true)
```
> Changing this default would be a breaking change and I'm not certain it's
warranted.
Without it python tests are failing.
> Should this only be `Ordering::Implicit` if `require_sequenced_output` is
set?
It requires deeper and breaking changes which I think are necessary.
`require_sequenced_output` - means the source should give implicit ordering
to produces batches ant therefore `require_sequenced_output` should be moved
from `ScanNodeOptions` to `ScanOptions` to allow pass this option to
`ScannerBuilder` which is used by python. But in fact I think there should be
unified way to assert implicit ordering in all source nodes. Or maybe the
_need_ for implicit ordering should propagate from nodes that need ordering
(asof_join, fetch etc.) down the line to source nodes (and maybe fail if the
source node cannot provide it). There are few related issues:
[no standardized sorting
information](https://github.com/apache/arrow/issues/34451)
[add ordering information to exec
batches](https://github.com/apache/arrow/issues/32991)
[Add AsofJoin Ordering
Assertion](https://github.com/apache/arrow/issues/20353)
This [issue](https://github.com/apache/arrow/issues/27651) gave me the idea
that implicit ordering should be asserted by default. And additional source
node/additional option to assert no ordering - to enable some performance
optimization for "don't care" ordering cases. This would fix those issues:
[asof_join node not working
propertly](https://github.com/apache/arrow/issues/41706)
[order is unstable](https://github.com/apache/arrow/issues/15144)
[Preserve order when writing
dataset](https://github.com/apache/arrow/issues/26818)
[ordering is weird](https://github.com/apache/arrow/issues/37542)
[dataset not preserving
ordering](https://github.com/apache/arrow/issues/39030)
[scan node not asserting
ordering](https://github.com/apache/arrow/issues/34698)
We are willing to contribute to fix ordering issue within acero but we have
next to none experience in python/Cython. Also the size of the issue seems to
grow with every little change. I think the ordering in Acero is a little bigger
topic to discuss.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]