Hi John,
Did you also set spark.sql.planner.externalSort to true? Probably you will
not see executor lost with this conf. For now, maybe you can manually split
the query to two parts, one for skewed keys and one for other records.
Then, you union then results of these two parts together.
Thanks,
could it be composed maybe? a general version and then a sql version that
exploits the additional info/abilities available there and uses the general
version internally...
i assume the sql version can benefit from the logical phase optimization to
pick join details. or is there more?
On Tue, Jun
>
> this would be a great addition to spark, and ideally it belongs in spark
> core not sql.
>
I agree with the fact that this would be a great addition, but we would
likely want a specialized SQL implementation for performance reasons.
a skew join (where the dominant key is spread across multiple executors) is
pretty standard in other frameworks, see for example in scalding:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala
this would be a great addition to sp
On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust
wrote:
> 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use
>> 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is
>> available, would any of the JOIN enhancements help this situation?
>>
>
> I would try Spa
>
> 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use
> 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is
> available, would any of the JOIN enhancements help this situation?
>
I would try Spark 1.4 after running "SET
spark.sql.planner.sortMergeJoin=true"
Greetings,
I am trying to implement a classic star schema ETL pipeline using Spark
SQL, 1.2.1. I am running into problems with shuffle joins, for those
dimension tables which have skewed keys and are too large to let Spark
broadcast them.
I have a few questions
1. Can I split my queries so a un