[spark] branch branch-3.0 updated: [SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly

wenchen Sun, 26 Apr 2020 21:29:46 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new a48872a  [SPARK-31388][SQL][TESTS] 
org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly
a48872a is described below

commit a48872a0ac0d7fbacea06f9a85e9a630597ee544
Author: Juliusz Sompolski <[email protected]>
AuthorDate: Mon Apr 27 04:27:55 2020 +0000

    [SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite 
doesn't match results correctly
    
    ### What changes were proposed in this pull request?
    
    `CliSuite.runCliWithin` was not matching for expected results correctly. It 
was matching for expected lines anywhere in stdout or stderr.
    On the example of `Single command with --database` test:
    In
    ```
        runCliWithin(2.minute)(
          "CREATE DATABASE hive_db_test;"
            -> "",
          "USE hive_test;"
            -> "",
          "CREATE TABLE hive_test(key INT, val STRING);"
            -> "",
          "SHOW TABLES;"
            -> "hive_test"
        )
    ```
    It was looking for lines containing "", "", "" and then "hive_test".
    However, the string "hive_test" was contained in "hive_test_db", and hence:
    ```
    2020-04-08 17:53:12,752 INFO  CliSuite - 2020-04-08 17:53:12.752 - stderr> 
Spark master: local, Application Id: local-1586368384172
    2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 
0: ""
    2020-04-08 17:53:12,765 INFO  CliSuite - 2020-04-08 17:53:12.765 - stdout> 
spark-sql> CREATE DATABASE hive_db_test;
    2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 
1: ""
    2020-04-08 17:53:17,688 INFO  CliSuite - 2020-04-08 17:53:17.688 - stderr> 
chgrp: changing ownership of 
'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group 
of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted
    2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 
2: ""
    2020-04-08 17:53:18,069 INFO  CliSuite - 2020-04-08 17:53:18.069 - stderr> 
Time taken: 5.265 seconds
    2020-04-08 17:53:18,087 INFO  CliSuite - 2020-04-08 17:53:18.087 - stdout> 
spark-sql> USE hive_test;
    2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 
3: "hive_test"
    2020-04-08 17:53:21,742 INFO  CliSuite - Found all expected output.
    ```
    Because of that, it could kill the CLI process without really even creating 
the table. This was not expected. The test could be flaky depending on whether 
process.destroy() in the finally block managed to kill it before it actually 
creates the table.
    
    I make the output checking more robust to not match on unexpected output, 
by making it check the echo of query output on the CLI. Also, wait for the CLI 
process to finish gracefully (triggered by closing its stdin), instead of 
killing it forcibly.
    
    ### Why are the changes needed?
    
    org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test 
outputs as expected.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests in CLISuite. Tested several times with no flakiness. Was 
getting flaky results almost on every run before.
    ```
    [info] CliSuite:
    [info] - load warehouse dir from hive-site.xml (12 seconds, 568 
milliseconds)
    [info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds)
    [info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 
653 milliseconds)
    [info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 
milliseconds)
    [info] - Simple commands (16 seconds, 238 milliseconds)
    [info] - Single command with -e (9 seconds, 967 milliseconds)
    [info] - Single command with --database (21 seconds, 205 milliseconds)
    [info] - Commands using SerDe provided in --jars (15 seconds, 51 
milliseconds)
    [info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path 
(14 seconds, 625 milliseconds)
    [info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds)
    [info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 
seconds, 424 milliseconds)
    [info] - list jars (9 seconds, 520 milliseconds)
    [info] - list jar <jarfile> (9 seconds, 277 milliseconds)
    [info] - list files (9 seconds, 828 milliseconds)
    [info] - list file <filepath> (9 seconds, 646 milliseconds)
    [info] - apply hiveconf from cli command (9 seconds, 469 milliseconds)
    [info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds)
    [info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds)
    [info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 
seconds, 49 milliseconds)
    [info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 
seconds, 210 milliseconds)
    [info] - SPARK-26321 Should not split semicolon within quoted string 
literals (12 seconds, 729 milliseconds)
    [info] - Pad Decimal numbers with trailing zeros to the scale of the column 
(10 seconds, 381 milliseconds)
    [info] - SPARK-30049 Should not complain for quotes in commented lines (10 
seconds, 935 milliseconds)
    [info] - SPARK-30049 Should not complain for quotes in commented with 
multi-lines (20 seconds, 731 milliseconds)
    ```
    
    Closes #28156 from juliuszsompolski/SPARK-31388.
    
    Authored-by: Juliusz Sompolski <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    (cherry picked from commit 560bd5401f65b7813b0ac4856cf9934d2ed24594)
    Signed-off-by: Wenchen Fan <[email protected]>
---
 .../spark/sql/hive/thriftserver/CliSuite.scala     | 70 +++++++++++++++-------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
index c393054..abefb46 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
@@ -87,9 +87,25 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll 
with BeforeAndAfterE
       useExternalHiveFile: Boolean = false)(
       queriesAndExpectedAnswers: (String, String)*): Unit = {
 
-    val (queries, expectedAnswers) = queriesAndExpectedAnswers.unzip
     // Explicitly adds ENTER for each statement to make sure they are actually 
entered into the CLI.
-    val queriesString = queries.map(_ + "\n").mkString
+    val queriesString = queriesAndExpectedAnswers.map(_._1 + "\n").mkString
+    // spark-sql echoes the queries on STDOUT, expect first an echo of the 
query, then the answer.
+    val expectedAnswers = queriesAndExpectedAnswers.flatMap {
+      case (query, answer) =>
+        if (query == "") {
+          // empty query means a command launched with -e
+          Seq(answer)
+        } else {
+          // spark-sql echoes the submitted queries
+          val queryEcho = query.split("\n").toList match {
+            case firstLine :: tail =>
+              s"spark-sql> $firstLine" :: tail.map(l => s"         > $l")
+          }
+          // longer lines sometimes get split in the output,
+          // match the first 60 characters of each query line
+          queryEcho.map(_.take(60)) :+ answer
+        }
+    }
 
     val extraHive = if (useExternalHiveFile) {
       s"--driver-class-path 
${System.getProperty("user.dir")}/src/test/noclasspath"
@@ -122,10 +138,13 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with BeforeAndAfterE
     def captureOutput(source: String)(line: String): Unit = lock.synchronized {
       // This test suite sometimes gets extremely slow out of unknown reason 
on Jenkins.  Here we
       // add a timestamp to provide more diagnosis information.
-      buffer += s"${new Timestamp(new Date().getTime)} - $source> $line"
+      val newLine = s"${new Timestamp(new Date().getTime)} - $source> $line"
+      log.info(newLine)
+      buffer += newLine
 
       // If we haven't found all expected answers and another expected answer 
comes up...
       if (next < expectedAnswers.size && line.contains(expectedAnswers(next))) 
{
+        log.info(s"$source> found expected output line $next: 
'${expectedAnswers(next)}'")
         next += 1
         // If all expected answers have been found...
         if (next == expectedAnswers.size) {
@@ -153,6 +172,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll 
with BeforeAndAfterE
 
     try {
       ThreadUtils.awaitResult(foundAllExpectedAnswers.future, timeout)
+      log.info("Found all expected output.")
     } catch { case cause: Throwable =>
       val message =
         s"""
@@ -161,8 +181,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll 
with BeforeAndAfterE
            |=======================
            |Spark SQL CLI command line: ${command.mkString(" ")}
            |Exception: $cause
-           |Executed query $next "${queries(next)}",
-           |But failed to capture expected output "${expectedAnswers(next)}" 
within $timeout.
+           |Failed to capture next expected output "${expectedAnswers(next)}" 
within $timeout.
            |
            |${buffer.mkString("\n")}
            |===========================
@@ -172,7 +191,13 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with BeforeAndAfterE
       logError(message, cause)
       fail(message, cause)
     } finally {
-      process.destroy()
+      if (!process.waitFor(1, MINUTES)) {
+        try {
+          fail("spark-sql did not exit gracefully.")
+        } finally {
+          process.destroy()
+        }
+      }
     }
   }
 
@@ -233,7 +258,8 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll 
with BeforeAndAfterE
         -> "",
       "SHOW TABLES;"
         -> "hive_test",
-      s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE hive_test;"
+      s"""LOAD DATA LOCAL INPATH '$dataFilePath'
+         |OVERWRITE INTO TABLE hive_test;""".stripMargin
         -> "",
       "CACHE TABLE hive_test;"
         -> "",
@@ -250,18 +276,18 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with BeforeAndAfterE
 
   test("Single command with --database") {
     runCliWithin(2.minute)(
-      "CREATE DATABASE hive_test_db;"
+      "CREATE DATABASE hive_db_test;"
         -> "",
-      "USE hive_test_db;"
+      "USE hive_db_test;"
         -> "",
-      "CREATE TABLE hive_test(key INT, val STRING);"
+      "CREATE TABLE hive_table_test(key INT, val STRING);"
         -> "",
       "SHOW TABLES;"
-        -> "hive_test"
+        -> "hive_table_test"
     )
 
-    runCliWithin(2.minute, Seq("--database", "hive_test_db", "-e", "SHOW 
TABLES;"))(
-      "" -> "hive_test"
+    runCliWithin(2.minute, Seq("--database", "hive_db_test", "-e", "SHOW 
TABLES;"))(
+      "" -> "hive_table_test"
     )
   }
 
@@ -273,12 +299,12 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with BeforeAndAfterE
 
     runCliWithin(3.minute, Seq("--jars", s"$jarFile"))(
       """CREATE TABLE t1(key string, val string)
-        |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
-      """.stripMargin
+        |ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe';""".stripMargin
         -> "",
       "CREATE TABLE sourceTable (key INT, val STRING) USING hive;"
         -> "",
-      s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE 
sourceTable;"
+      s"""LOAD DATA LOCAL INPATH '$dataFilePath'
+         |OVERWRITE INTO TABLE sourceTable;""".stripMargin
         -> "",
       "INSERT INTO TABLE t1 SELECT key, val FROM sourceTable;"
         -> "",
@@ -299,12 +325,12 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with BeforeAndAfterE
       3.minute,
       Seq("--conf", s"spark.hadoop.${ConfVars.HIVEAUXJARS}=$hiveContribJar"))(
       """CREATE TABLE addJarWithHiveAux(key string, val string)
-        |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
-      """.stripMargin
+        |ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe';""".stripMargin
         -> "",
       "CREATE TABLE sourceTableForWithHiveAux (key INT, val STRING) USING 
hive;"
         -> "",
-      s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE 
sourceTableForWithHiveAux;"
+      s"""LOAD DATA LOCAL INPATH '$dataFilePath'
+         |OVERWRITE INTO TABLE sourceTableForWithHiveAux;""".stripMargin
         -> "",
       "INSERT INTO TABLE addJarWithHiveAux SELECT key, val FROM 
sourceTableForWithHiveAux;"
         -> "",
@@ -419,12 +445,12 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with BeforeAndAfterE
       3.minute)(
       s"ADD JAR ${hiveContribJar};" -> "",
       """CREATE TABLE addJarWithSQL(key string, val string)
-        |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
-      """.stripMargin
+        |ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe';""".stripMargin
         -> "",
       "CREATE TABLE sourceTableForWithSQL(key INT, val STRING) USING hive;"
         -> "",
-      s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE 
sourceTableForWithSQL;"
+      s"""LOAD DATA LOCAL INPATH '$dataFilePath'
+         |OVERWRITE INTO TABLE sourceTableForWithSQL;""".stripMargin
         -> "",
       "INSERT INTO TABLE addJarWithSQL SELECT key, val FROM 
sourceTableForWithSQL;"
         -> "",


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.0 updated: [SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly

Reply via email to