Hello, guys.

Thank you for answers!

> I think pushing down a sort .... could make a big difference.
> You can however proposes to the data source api 2 to be included.

Jörn, are you talking about this jira issue? - 
https://issues.apache.org/jira/browse/SPARK-15689
Is there any additional documentation I has to learn before making any 
proposition?



04.12.2017 14:05, Holden Karau пишет:
I think pushing down a sort (or really more in the case where the data is already naturally returned in sorted order on some column) could make a big difference. Probably the simplest argument for a lot of time being spent sorting (in some use cases) is the fact it's still one of the standard benchmarks.

On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfra...@gmail.com 
<mailto:jornfra...@gmail.com>> wrote:

    I do not think that the data source api exposes such a thing. You can 
however proposes to the data source api 2 to be included.

    However there are some caveats , because sorted can mean two different 
things (weak vs strict order).

    Then, is really a lot of time lost because of sorting? The best thing is to 
not read data that is not needed at all (see min/max indexes in orc/parquet or 
bloom filters in Orc). What is not read
    does not need to be sorted. See also predicate pushdown.

     > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov....@gmail.com 
<mailto:nizhikov....@gmail.com>> wrote:
     >
     > Cross-posting from @user.
     >
     > Hello, guys!
     >
     > I work on implementation of custom DataSource for Spark Data Frame API 
and have a question:
     >
     > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort 
data inside a partition in my data source.
     >
     > Do I have a built-in option to tell spark that data from each partition 
already sorted?
     >
     > It seems that Spark can benefit from usage of already sorted partitions.
     > By using of distributed merge sort algorithm, for example.
     >
     > Does it make sense for you?
     >
     >
     > 28.11.2017 18:42, Michael Artz пишет:
     >> I'm not sure other than retrieving from a hive table that is already 
sorted.  This sounds cool though, would be interested to know this as well
     >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov....@gmail.com 
<mailto:nizhikov....@gmail.com> <mailto:nizhikov....@gmail.com <mailto:nizhikov....@gmail.com>>> 
wrote:
     >>    Hello, guys!
     >>    I work on implementation of custom DataSource for Spark Data Frame 
API and have a question:
     >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can 
sort data inside a partition in my data source.
     >>    Do I have a built-in option to tell spark that data from each 
partition already sorted?
     >>    It seems that Spark can benefit from usage of already sorted 
partitions.
     >>    By using of distributed merge sort algorithm, for example.
     >>    Does it make sense for you?
     >>    ---------------------------------------------------------------------
     >>    To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org> <mailto:user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org>>
     >
     > ---------------------------------------------------------------------
     > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
<mailto:dev-unsubscr...@spark.apache.org>
     >

    ---------------------------------------------------------------------
    To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
<mailto:dev-unsubscr...@spark.apache.org>




--
Twitter: https://twitter.com/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to