Sorry, s/ordered distributed/ordered distribution/g On Mon, Dec 4, 2017 at 10:37 AM, Li Jin <ice.xell...@gmail.com> wrote:
> Just to give another data point: most of the data we use with Spark are > sorted on disk, having a way to allow data source to pass ordered > distributed to DataFrames is really useful for us. > > On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <nizhikov....@gmail.com> > wrote: > >> Hello, guys. >> >> Thank you for answers! >> >> > I think pushing down a sort .... could make a big difference. >> > You can however proposes to the data source api 2 to be included. >> >> Jörn, are you talking about this jira issue? - >> https://issues.apache.org/jira/browse/SPARK-15689 >> Is there any additional documentation I has to learn before making any >> proposition? >> >> >> >> 04.12.2017 14:05, Holden Karau пишет: >> >>> I think pushing down a sort (or really more in the case where the data >>> is already naturally returned in sorted order on some column) could make a >>> big difference. Probably the simplest argument for a lot of time being >>> spent sorting (in some use cases) is the fact it's still one of the >>> standard benchmarks. >>> >>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfra...@gmail.com >>> <mailto:jornfra...@gmail.com>> wrote: >>> >>> I do not think that the data source api exposes such a thing. You >>> can however proposes to the data source api 2 to be included. >>> >>> However there are some caveats , because sorted can mean two >>> different things (weak vs strict order). >>> >>> Then, is really a lot of time lost because of sorting? The best >>> thing is to not read data that is not needed at all (see min/max indexes in >>> orc/parquet or bloom filters in Orc). What is not read >>> does not need to be sorted. See also predicate pushdown. >>> >>> > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov....@gmail.com >>> <mailto:nizhikov....@gmail.com>> wrote: >>> > >>> > Cross-posting from @user. >>> > >>> > Hello, guys! >>> > >>> > I work on implementation of custom DataSource for Spark Data >>> Frame API and have a question: >>> > >>> > If I have a `SELECT * FROM table1 ORDER BY some_column` query I >>> can sort data inside a partition in my data source. >>> > >>> > Do I have a built-in option to tell spark that data from each >>> partition already sorted? >>> > >>> > It seems that Spark can benefit from usage of already sorted >>> partitions. >>> > By using of distributed merge sort algorithm, for example. >>> > >>> > Does it make sense for you? >>> > >>> > >>> > 28.11.2017 18:42, Michael Artz пишет: >>> >> I'm not sure other than retrieving from a hive table that is >>> already sorted. This sounds cool though, would be interested to know this >>> as well >>> >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" < >>> nizhikov....@gmail.com <mailto:nizhikov....@gmail.com> <mailto: >>> nizhikov....@gmail.com <mailto:nizhikov....@gmail.com>>> wrote: >>> >> Hello, guys! >>> >> I work on implementation of custom DataSource for Spark Data >>> Frame API and have a question: >>> >> If I have a `SELECT * FROM table1 ORDER BY some_column` query >>> I can sort data inside a partition in my data source. >>> >> Do I have a built-in option to tell spark that data from each >>> partition already sorted? >>> >> It seems that Spark can benefit from usage of already sorted >>> partitions. >>> >> By using of distributed merge sort algorithm, for example. >>> >> Does it make sense for you? >>> >> ------------------------------------------------------------ >>> --------- >>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> <mailto:user-unsubscr...@spark.apache.org> <mailto: >>> user-unsubscr...@spark.apache.org <mailto:user-unsubscribe@spark >>> .apache.org>> >>> > >>> > ------------------------------------------------------------ >>> --------- >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto: >>> dev-unsubscr...@spark.apache.org> >>> > >>> >>> ------------------------------------------------------------ >>> --------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto: >>> dev-unsubscr...@spark.apache.org> >>> >>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >