Re: Spark Data Frame. PreSorded partitions

Николай Ижиков Mon, 04 Dec 2017 06:12:47 -0800

Hello, guys.

Thank you for answers!


> I think pushing down a sort .... could make a big difference.
> You can however proposes to the data source api 2 to be included.

Jörn, are you talking about this jira issue? - 
https://issues.apache.org/jira/browse/SPARK-15689
Is there any additional documentation I has to learn before making any 
proposition?



04.12.2017 14:05, Holden Karau пишет:

I think pushing down a sort (or really more in the case where the data is already naturally returned in sorted order on some column) could make a big difference. Probably the simplest argument for alot of time being spent sorting (in some use cases) is the fact it's still one of the standard benchmarks.


On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <[email protected] 
<mailto:[email protected]>> wrote:

    I do not think that the data source api exposes such a thing. You can 
however proposes to the data source api 2 to be included.

    However there are some caveats , because sorted can mean two different 
things (weak vs strict order).

    Then, is really a lot of time lost because of sorting? The best thing is to 
not read data that is not needed at all (see min/max indexes in orc/parquet or 
bloom filters in Orc). What is not read
    does not need to be sorted. See also predicate pushdown.

     > On 4. Dec 2017, at 07:50, Николай Ижиков <[email protected] 
<mailto:[email protected]>> wrote:
     >
     > Cross-posting from @user.
     >
     > Hello, guys!
     >
     > I work on implementation of custom DataSource for Spark Data Frame API 
and have a question:
     >
     > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort 
data inside a partition in my data source.
     >
     > Do I have a built-in option to tell spark that data from each partition 
already sorted?
     >
     > It seems that Spark can benefit from usage of already sorted partitions.
     > By using of distributed merge sort algorithm, for example.
     >
     > Does it make sense for you?
     >
     >
     > 28.11.2017 18:42, Michael Artz пишет:
     >> I'm not sure other than retrieving from a hive table that is already 
sorted.  This sounds cool though, would be interested to know this as well
     >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <[email protected] 
<mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> 
wrote:
     >>    Hello, guys!
     >>    I work on implementation of custom DataSource for Spark Data Frame 
API and have a question:
     >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can 
sort data inside a partition in my data source.
     >>    Do I have a built-in option to tell spark that data from each 
partition already sorted?
     >>    It seems that Spark can benefit from usage of already sorted 
partitions.
     >>    By using of distributed merge sort algorithm, for example.
     >>    Does it make sense for you?
     >>    ---------------------------------------------------------------------
     >>    To unsubscribe e-mail: [email protected] 
<mailto:[email protected]> <mailto:[email protected] 
<mailto:[email protected]>>
     >
     > ---------------------------------------------------------------------
     > To unsubscribe e-mail: [email protected] 
<mailto:[email protected]>
     >

    ---------------------------------------------------------------------
    To unsubscribe e-mail: [email protected] 
<mailto:[email protected]>




--
Twitter: https://twitter.com/holdenkarau


---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Spark Data Frame. PreSorded partitions

Reply via email to