Hi all,

We ingest our data into dataframes with multiple naturally co-sorted
columns.
The redundant sort required during large SortMergeJoin operations takes
substantial time that we'd like to optimise -- a plain merge should be
sufficient.

Is there a mechanism to avoid these sorts in general?
Do we need to persist all our frames as tables with sortBy+bucketBy to get
this optimisation?
If we use sorted tables, does the "sorted by" metadata persist past the
first join or do we need to re-write each intermediate result to a (possibly
re-sorted?) table to maintain the metadata in the catalog?

Our data actually has multiple distinct monotonically-increasing columns.
The catalog doesn't seem to be able to capture this information, requiring a
re-sort when we want to join along a different-but-still-sorted dimension.
We've prototyped hacking the external catalog to let us intercept cases
where we want to assert that outputOrdering is equivalent to the
requiredOrdering (see SortOrder.orderingSatisfies), but this feels like an
abuse.
Table information has been lost at these points too, so we need to infer
sortedness by comparing raw column names extracted from SortOrder
expressions.
This breaks in cases where our processing has caused the data to *lose* its
sortedness.

Have we missed something simple or do we have an exotic use-case unlike
other users?

Thanks!
Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to