Re: Issues with Beam SQL on Spark

Andrew Pilloud Tue, 24 Jul 2018 12:08:58 -0700

I don't really think this is something that involves changes to
DriverManager. Beam is causing the problem by relocating calcite's path but
not also modifying the global state it creates.


Andrew

On Tue, Jul 24, 2018 at 12:03 PM Kai Jiang <[email protected]> wrote:

> Thanks Andrew! It's really helpful. I'll take a try on shade calcite with
> rewriting the "jdbc:calcite".
> I also have a look at the doc of DriverManager. Do you think include all
> repackaged jdbc driver property setting like below will be helpful?
>  jdbc.drivers=org.apache.beam.repackaged.beam.
>
> Best,
> Kai
>
> On 2018/07/24 16:56:50, Andrew Pilloud <[email protected]> wrote:
> > Looks like calcite isn't easily repackageable. This issue can be fixed
> > either in our shading (by also rewriting the "jdbc:calcite:" string when
> we
> > shade calcite) or in calcite (by not using the driver manager to connect
> > between calcite modules).
> >
> > Andrew
> >
> > On Mon, Jul 23, 2018 at 11:18 PM Kai Jiang <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I met an issue when I ran Beam SQL on Spark. I want to check and see if
> > > anyone has same issue with me. I believe let beam sql running on spark
> is
> > > important. If you encountered same problem, it will be really helpful
> if
> > > you could give some inputs.
> > >
> > > Context:
> > > I setup TPC framework to run sql on spark. Code
> > > <
> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/src/main/java/org/apache/beam/sdk/extensions/tpc/BeamTpc.java
> >
> > > is simple which just ingests csv data and apply Sql on that. Gradle
> > > <
> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/build.gradle>
> setting
> > > includes `runner-spark` and necessary libraries.  Exception Stack trace
> > > <https://gist.github.com/vectorijk/849cbcd5bce558e5e7c97916ca4c793a>
> shows
> > > some details. However, same code can running on Flink and Dataflow
> > > successfully.
> > >
> > > Investigations:
> > > BEAM-3386 <https://issues.apache.org/jira/browse/BEAM-3386> also
> > > describes the similar issue I have. It took me some time on
> investigating
> > > it. I guess there should be a version conflict between Calcite library
> in
> > > Spark and Beam SQL repackaged Calcite. The version of Calcite library
> Spark
> > > ( * - 2.3.1) used is very old (1.2.0-incubating).
> > >
> > > After packaging fat jar and submitting it to Spark, Spark registered
> both
> > > old version's calcite jdbc driver and Beam's repackaged jdbc driver in
> > >
> > > registeredDrivers(DriverManager.java#L294 <
> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L294>).
> Jdbc's DriverManager always connects to old version calcite's jdbc in spark
> instead of beam's repackaged calcite.
> > >
> > >
> > > Looking into Line DriverManager.java#L556 <
> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556>
> and insert a breakpoint, aClass =
> Class.forName(driver.getClass().getName(), true, classLoader);
> > >
> > > driver.getClass().getName() -> "org.apache.calcite.jdbc.Driver"
> > > classLoader only has class 'org.apache.beam.**' and
> > > 'org.apache.beam.repackaged.beam_***'. (There is no path of class
> > > 'org.apache.calcite.*')
> > >
> > > Oddly, aClass is assigned with Class "org.apache.calcite.jdbc.Driver".
> I
> > > think it should raise an exception and be skipped. Actually, It did
> not.  So
> > > this spark's calcite jdbc driver has been connected. All logic
> afterwards
> > > goes to spark's calcite classpath. I believe that's pivot point.
> > >
> > > Potentially solutions:
> > > *1.* Figure out why DriverManager.java#L556
> > > <
> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556>
> does
> > > not throw exception.
> > >
> > > I guess it is the best option.
> > >
> > > 2. Upgrade Spark' calcite.
> > >
> > > It is not a good option because old calcite version affects many spark
> > > versions.
> > >
> > > 3. Not using repackage for calcite library.
> > >
> > > I tried. I built fat jar with non-repackaged calcite. But, Spark is
> still
> > > using its own calcite.
> > >
> > > Plus, I am curious if there is any specific reason we need to use
> > > repackage strategy for Calcite. @Mingmin Xu <[email protected]>
> > >
> > >
> > > Thanks for reading!
> > >
> > > Best,
> > > Kai
> > > ᐧ
> > >
> >
>

Re: Issues with Beam SQL on Spark

Reply via email to