I think pushing down a sort (or really more in the case where the data is already naturally returned in sorted order on some column) could make a big difference. Probably the simplest argument for a lot of time being spent sorting (in some use cases) is the fact it's still one of the standard benchmarks.
On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfra...@gmail.com> wrote: > I do not think that the data source api exposes such a thing. You can > however proposes to the data source api 2 to be included. > > However there are some caveats , because sorted can mean two different > things (weak vs strict order). > > Then, is really a lot of time lost because of sorting? The best thing is > to not read data that is not needed at all (see min/max indexes in > orc/parquet or bloom filters in Orc). What is not read does not need to be > sorted. See also predicate pushdown. > > > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov....@gmail.com> wrote: > > > > Cross-posting from @user. > > > > Hello, guys! > > > > I work on implementation of custom DataSource for Spark Data Frame API > and have a question: > > > > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort > data inside a partition in my data source. > > > > Do I have a built-in option to tell spark that data from each partition > already sorted? > > > > It seems that Spark can benefit from usage of already sorted partitions. > > By using of distributed merge sort algorithm, for example. > > > > Does it make sense for you? > > > > > > 28.11.2017 18:42, Michael Artz пишет: > >> I'm not sure other than retrieving from a hive table that is already > sorted. This sounds cool though, would be interested to know this as well > >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov....@gmail.com > <mailto:nizhikov....@gmail.com>> wrote: > >> Hello, guys! > >> I work on implementation of custom DataSource for Spark Data Frame > API and have a question: > >> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can > sort data inside a partition in my data source. > >> Do I have a built-in option to tell spark that data from each > partition already sorted? > >> It seems that Spark can benefit from usage of already sorted > partitions. > >> By using of distributed merge sort algorithm, for example. > >> Does it make sense for you? > >> ------------------------------------------------------------ > --------- > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto: > user-unsubscr...@spark.apache.org> > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Twitter: https://twitter.com/holdenkarau