Issues with Beam SQL on Spark

Kai Jiang Mon, 23 Jul 2018 23:19:06 -0700

Hi all,

I met an issue when I ran Beam SQL on Spark. I want to check and see if
anyone has same issue with me. I believe let beam sql running on spark is
important. If you encountered same problem, it will be really helpful if
you could give some inputs.

Context:
I setup TPC framework to run sql on spark. Code
<https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/src/main/java/org/apache/beam/sdk/extensions/tpc/BeamTpc.java>
is simple which just ingests csv data and apply Sql on that. Gradle
<https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/build.gradle>
setting
includes `runner-spark` and necessary libraries. Exception Stack trace
<https://gist.github.com/vectorijk/849cbcd5bce558e5e7c97916ca4c793a> shows
some details. However, same code can running on Flink and Dataflow
successfully.

Investigations:
BEAM-3386 <https://issues.apache.org/jira/browse/BEAM-3386> also describes
the similar issue I have. It took me some time on investigating it. I
guess there should be a version conflict between Calcite library in Spark
and Beam SQL repackaged Calcite. The version of Calcite library Spark ( * -
2.3.1) used is very old (1.2.0-incubating).

After packaging fat jar and submitting it to Spark, Spark registered both
old version's calcite jdbc driver and Beam's repackaged jdbc driver in

registeredDrivers(DriverManager.java#L294
<https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L294>).
Jdbc's DriverManager always connects to old version calcite's jdbc in
spark instead of beam's repackaged calcite.

Looking into Line DriverManager.java#L556
<https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556>
and insert a breakpoint, aClass =
Class.forName(driver.getClass().getName(), true, classLoader);

driver.getClass().getName() -> "org.apache.calcite.jdbc.Driver"
classLoader only has class 'org.apache.beam.**' and
'org.apache.beam.repackaged.beam_***'. (There is no path of class
'org.apache.calcite.*')

Oddly, aClass is assigned with Class "org.apache.calcite.jdbc.Driver". I
think it should raise an exception and be skipped. Actually, It did not. So
this spark's calcite jdbc driver has been connected. All logic afterwards
goes to spark's calcite classpath. I believe that's pivot point.

Potentially solutions:
*1.* Figure out why DriverManager.java#L556
<https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556>
does
not throw exception.

I guess it is the best option.

2. Upgrade Spark' calcite.

It is not a good option because old calcite version affects many spark
versions.

3. Not using repackage for calcite library.

I tried. I built fat jar with non-repackaged calcite. But, Spark is still
using its own calcite.

Plus, I am curious if there is any specific reason we need to use
repackage strategy for Calcite. @Mingmin Xu <[email protected]>

Thanks for reading!

Best,
Kai
ᐧ

Issues with Beam SQL on Spark

Reply via email to