Re: Issues with Beam SQL on Spark

Andrew Pilloud Mon, 30 Jul 2018 09:04:38 -0700

That sounds great. I'm subscribed to that list as well, so I'll keep an eye
out for your email.


Andrew

On Sun, Jul 29, 2018 at 9:07 PM Kai Jiang <[email protected]> wrote:

> Hi Andrew,
>
> I tried on replacing "jdbc:calcite" to "jdbc:beam" in calcite and
> re-shadow. After that, Beam Sql can run on Spark now.
> However, I didn't find an approach to modify code during shading Calcite
> library. I think second method you mentioned is feasible.
> I'll forward this thread to dev@calcite and to see if we can connect
> between calcite modules without using the DriverManager.
>
> Best,
> Kai
> ᐧ
>
> On Tue, Jul 24, 2018 at 1:04 PM Kai Jiang <[email protected]> wrote:
>
>> Thank you Andrew! I will take a look at if it is feasible to rewrite
>> "jdbc:calcite:" in Beam's repackaged calcite.
>>
>> Best,
>> Kai
>>
>> On 2018/07/24 19:08:17, Andrew Pilloud <[email protected]> wrote:
>> > I don't really think this is something that involves changes to
>> > DriverManager. Beam is causing the problem by relocating calcite's path
>> but
>> > not also modifying the global state it creates.
>> >
>> > Andrew
>> >
>> > On Tue, Jul 24, 2018 at 12:03 PM Kai Jiang <[email protected]> wrote:
>> >
>> > > Thanks Andrew! It's really helpful. I'll take a try on shade calcite
>> with
>> > > rewriting the "jdbc:calcite".
>> > > I also have a look at the doc of DriverManager. Do you think include
>> all
>> > > repackaged jdbc driver property setting like below will be helpful?
>> > >  jdbc.drivers=org.apache.beam.repackaged.beam.
>> > >
>> > > Best,
>> > > Kai
>> > >
>> > > On 2018/07/24 16:56:50, Andrew Pilloud <[email protected]> wrote:
>> > > > Looks like calcite isn't easily repackageable. This issue can be
>> fixed
>> > > > either in our shading (by also rewriting the "jdbc:calcite:" string
>> when
>> > > we
>> > > > shade calcite) or in calcite (by not using the driver manager to
>> connect
>> > > > between calcite modules).
>> > > >
>> > > > Andrew
>> > > >
>> > > > On Mon, Jul 23, 2018 at 11:18 PM Kai Jiang <[email protected]>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I met an issue when I ran Beam SQL on Spark. I want to check and
>> see if
>> > > > > anyone has same issue with me. I believe let beam sql running on
>> spark
>> > > is
>> > > > > important. If you encountered same problem, it will be really
>> helpful
>> > > if
>> > > > > you could give some inputs.
>> > > > >
>> > > > > Context:
>> > > > > I setup TPC framework to run sql on spark. Code
>> > > > > <
>> > >
>> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/src/main/java/org/apache/beam/sdk/extensions/tpc/BeamTpc.java
>> > > >
>> > > > > is simple which just ingests csv data and apply Sql on that.
>> Gradle
>> > > > > <
>> > >
>> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/build.gradle
>> >
>> > > setting
>> > > > > includes `runner-spark` and necessary libraries.  Exception Stack
>> trace
>> > > > > <
>> https://gist.github.com/vectorijk/849cbcd5bce558e5e7c97916ca4c793a>
>> > > shows
>> > > > > some details. However, same code can running on Flink and Dataflow
>> > > > > successfully.
>> > > > >
>> > > > > Investigations:
>> > > > > BEAM-3386 <https://issues.apache.org/jira/browse/BEAM-3386> also
>> > > > > describes the similar issue I have. It took me some time on
>> > > investigating
>> > > > > it. I guess there should be a version conflict between Calcite
>> library
>> > > in
>> > > > > Spark and Beam SQL repackaged Calcite. The version of Calcite
>> library
>> > > Spark
>> > > > > ( * - 2.3.1) used is very old (1.2.0-incubating).
>> > > > >
>> > > > > After packaging fat jar and submitting it to Spark, Spark
>> registered
>> > > both
>> > > > > old version's calcite jdbc driver and Beam's repackaged jdbc
>> driver in
>> > > > >
>> > > > > registeredDrivers(DriverManager.java#L294 <
>> > >
>> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L294
>> >).
>> > > Jdbc's DriverManager always connects to old version calcite's jdbc in
>> spark
>> > > instead of beam's repackaged calcite.
>> > > > >
>> > > > >
>> > > > > Looking into Line DriverManager.java#L556 <
>> > >
>> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556
>> >
>> > > and insert a breakpoint, aClass =
>> > > Class.forName(driver.getClass().getName(), true, classLoader);
>> > > > >
>> > > > > driver.getClass().getName() -> "org.apache.calcite.jdbc.Driver"
>> > > > > classLoader only has class 'org.apache.beam.**' and
>> > > > > 'org.apache.beam.repackaged.beam_***'. (There is no path of class
>> > > > > 'org.apache.calcite.*')
>> > > > >
>> > > > > Oddly, aClass is assigned with Class
>> "org.apache.calcite.jdbc.Driver".
>> > > I
>> > > > > think it should raise an exception and be skipped. Actually, It
>> did
>> > > not.  So
>> > > > > this spark's calcite jdbc driver has been connected. All logic
>> > > afterwards
>> > > > > goes to spark's calcite classpath. I believe that's pivot point.
>> > > > >
>> > > > > Potentially solutions:
>> > > > > *1.* Figure out why DriverManager.java#L556
>> > > > > <
>> > >
>> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556
>> >
>> > > does
>> > > > > not throw exception.
>> > > > >
>> > > > > I guess it is the best option.
>> > > > >
>> > > > > 2. Upgrade Spark' calcite.
>> > > > >
>> > > > > It is not a good option because old calcite version affects many
>> spark
>> > > > > versions.
>> > > > >
>> > > > > 3. Not using repackage for calcite library.
>> > > > >
>> > > > > I tried. I built fat jar with non-repackaged calcite. But, Spark
>> is
>> > > still
>> > > > > using its own calcite.
>> > > > >
>> > > > > Plus, I am curious if there is any specific reason we need to use
>> > > > > repackage strategy for Calcite. @Mingmin Xu <[email protected]>
>> > > > >
>> > > > >
>> > > > > Thanks for reading!
>> > > > >
>> > > > > Best,
>> > > > > Kai
>> > > > > ᐧ
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Issues with Beam SQL on Spark

Reply via email to