Re: Help explaining explain() after DataFrame join reordering

2018-06-05 Thread Matteo Cossu
Hello,

as explained here
,
the join order can be changed by the optimizer. The difference introduced
in Spark 2.2 is that the reordering is based on statistics instead of
heuristics, that can appear "random" and for some cases decrease the
performances.
If you want to control more the join order you can define your own Rule, an
example here.


Best,

Matteo


On 1 June 2018 at 18:31, Mohamed Nadjib MAMI 
wrote:

> Dear Sparkers,
>
> I'm loading into DataFrames data from 5 sources (using official
> connectors): Parquet, MongoDB, Cassandra, MySQL and CSV. I'm then joining
> those DataFrames in two different orders.
> - mongo * cassandra * jdbc * parquet * csv (random order).
> - parquet * csv * cassandra * jdbc * mongodb (optimized order).
>
> The first follows a random order, whereas the second I'm deciding based on
> some optimization techniques (can provide details for the interested
> readers or if needed here).
>
> After the evaluation on increasing sizes of data, the optimization
> techniques I developed didn't improve the performance very noticeably. I
> inspected the Logical/Physical plan of the final joined DataFrame (using
> `explain(true)`). The 1st order was respected, whereas the 2nd order, it
> turned out, wasn't respected, and MongoDB was queried first.
>
> However, that what it seemed to me, I'm not quite confident reading the
> Plans (returned using explain(true)). Could someone help explaining the
> `explain(true)` output? (pasted in this gist
> ). Is
> there a way we could enforce the given order?
>
> I'm using Spark 2.1, so I think it doesn't include the new cost-based
> optimizations (introduced in Spark 2.2).
>
> *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
> تحياتي.*
> *Mohamed Nadjib Mami*
> *Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
> *About me! *
> *LinkedIn *
>


Help explaining explain() after DataFrame join reordering

2018-06-01 Thread Mohamed Nadjib MAMI
Dear Sparkers,

I'm loading into DataFrames data from 5 sources (using official
connectors): Parquet, MongoDB, Cassandra, MySQL and CSV. I'm then joining
those DataFrames in two different orders.
- mongo * cassandra * jdbc * parquet * csv (random order).
- parquet * csv * cassandra * jdbc * mongodb (optimized order).

The first follows a random order, whereas the second I'm deciding based on
some optimization techniques (can provide details for the interested
readers or if needed here).

After the evaluation on increasing sizes of data, the optimization
techniques I developed didn't improve the performance very noticeably. I
inspected the Logical/Physical plan of the final joined DataFrame (using
`explain(true)`). The 1st order was respected, whereas the 2nd order, it
turned out, wasn't respected, and MongoDB was queried first.

However, that what it seemed to me, I'm not quite confident reading the
Plans (returned using explain(true)). Could someone help explaining the
`explain(true)` output? (pasted in this gist
). Is
there a way we could enforce the given order?

I'm using Spark 2.1, so I think it doesn't include the new cost-based
optimizations (introduced in Spark 2.2).

*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
*About me! *
*LinkedIn *