[GitHub] [spark] hvanhovell opened a new pull request, #43195: [SPARK-45371][CONNECT] Fix shading issues in the Spark Connect Scala Client

via GitHub Sun, 01 Oct 2023 21:38:07 -0700


hvanhovell opened a new pull request, #43195:
URL: https://github.com/apache/spark/pull/43195


   ### What changes were proposed in this pull request?
   This PR fixes shading for the Spark Connect Scala Client maven build. The 
following things are addressed:
   - Guava and protobuf are included in the shaded jars. These were missing, 
and were causing users to see `ClassNotFoundException`s.
   - Fixed duplicate shading of guava. We use the parent pom's location now.
   - Fixed duplicate Netty dependency (shaded and transitive). One was used for 
GRPC and the other was needed by Arrow. This was fixed by pulling arrow into 
the shaded jar.
   - Use the same package as the shading defined in the parent package.
   
   ### Why are the changes needed?
   The maven artifacts for the Spark Conect Scala Client are currently broken.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Manual tests.
   #### Step 1:  Build new shaded library and install it in local maven 
repository
   `build/mvn clean install -pl connector/connect/client/jvm -am -DskipTests`
   #### Step 2: Start Connect Server
   `connector/connect/bin/spark-connect`
   #### Step 3: Launch REPL using the newly created library
   This step requires [coursier](https://get-coursier.io/) to be installed.
   `cs launch --scala 2.12.17 -r m2Local com.lihaoyi:::ammonite:2.5.8 
org.apache.spark::spark-connect-client-jvm:3.5.1-SNAPSHOT -M 
org.apache.spark.sql.application.ConnectRepl`
   #### Step 4: Run a bunch of commands:
   ```scala
   // Check version
   spark.version
   
   // Run a simple query
   spark.range(1, 10000, 1)
     .select($"id", $"id" % 5 as "group", rand(1).as("v1"), rand(2).as("v2"))
     .groupBy($"group")
     .agg(
       avg($"v1").as("v1_avg"),
       avg($"v2").as("v2_avg"))
     .show()
   
   // Run a streaming query
   import org.apache.spark.sql.execution.streaming.ProcessingTimeTrigger
   val query_name = "simple_streaming"
   val stream = spark.readStream
     .format("rate")
     .option("numPartitions", "1")
     .option("rowsPerSecond", "10")
     .load()
     .withWatermark("timestamp", "10 milliseconds")
     .groupBy(window(col("timestamp"), "10 milliseconds"))
     .count()
     .selectExpr("window.start as timestamp", "count as num_events")
     .writeStream
     .format("memory")
     .queryName(query_name)
     .trigger(ProcessingTimeTrigger.create("10 milliseconds"))
   // run for 20 seconds
   val query = stream.start()
   val start = System.currentTimeMillis()
   val end = System.currentTimeMillis() + 20 * 1000
   while (System.currentTimeMillis() < end) {
     println(s"time: ${System.currentTimeMillis() - start} ms")
     println(query.status)
     spark.sql(s"select * from ${query_name}").show()
     Thread.sleep(500)
   }
       query.stop()
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] hvanhovell opened a new pull request, #43195: [SPARK-45371][CONNECT] Fix shading issues in the Spark Connect Scala Client

Reply via email to