Example to handle data skewness

2018-01-29 Thread Sejal Chauhan
Hi Dev community,

A large data skew is leading to memory problem in my cluster. I was
wondering if anyone has tackled this with their own hash function and it
worked for the same cluster configuration.

Thanks,
Sejal


BroadcastHashJoinExec cleanup

2018-01-29 Thread Marco Gaido
Hello,

looking at BroadcastHashJoinExec, it seems to me that it never destroys the
broadcasted variables. And I think this can cause problems like SPARK-22575.

Anyway, when I tried to add a "cleanup" to destroy the variable, I saw some
test failure because it was trying to access a the destroyed broadcasted
variable.

I think that the reason of this relies in BroadcastExchangeExec, where the
same broadcasted relation can be provided if there are 2 or more
invocations.

Then my questions are: first of all, am I right or am I missing something?
If I am right, in which cases a BroadcastExchangeExec can be used more than
once (I can't think of any)?

Thanks,
Marco


Nondeterministic Catalyst expressions -- trait and property?!

2018-01-29 Thread Jacek Laskowski
Hi,

Why does Spark SQL need Nondeterministic trait [1] and property? That must
be confusing for others not only me, right?

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L299

[2]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala?utf8=%E2%9C%93#L83

Given the exact same names I suspect Nondeterministic trait does more than
the name says (and property could express alone). Any plans to "fix" this
(e.g. renaming the trait)?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski