fwc commented on code in PR #56190:
URL: https://github.com/apache/spark/pull/56190#discussion_r3359763783


##########
sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:
##########
@@ -1211,7 +1211,7 @@ object QueryTest extends Assertions {
 
 }
 
-class QueryTestSuite extends test.SharedSparkSession {
+class QueryTestSuite extends QueryTest with SparkSessionBinder {

Review Comment:
   For scala tests in `sql/core`, I envision a test style (and corresponding 
infra) that encourages:
   
   1. Testing at the [Spark SQL/DataFrame 
API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/index.html)
 level (i.e. discourages/prevents accessing internals)
   1. Targeting both classic _and connect_
   1. A rather [DAMP than 
DRY](https://abseil.io/resources/swe-book/html/ch12.html#tests_and_code_sharing_dampcomma_not_dr)
 style (i.e. tests as `spark` and `df` ops + checks with little/no 
'abstraction')
   
   While _DAMP, not DRY_ would mostly be enshrined in guidelines like 
`AGENTS.md`, a testing style guide, or a `sql/core/src/test/README.md`, I think 
that _testing the API using classic and connect_ can be made easier by further 
tweaking and extending the testing APIs / traits.
   
   In some sense, this 'proposal' isn't proposing something 'new': Many (most?) 
suites are already more _DAMP_ than _DRY_ and with the introduction of 
`sql.SparkSessionProvider` and its usage in `QueryTest`, writing 
classic/connect-agnostic tests is already much easier than before. But (a) 
afaics _DAMP NOT DRY_ is not (yet) codified anywhere and (b) I think the 
testing traits can be further improved to make 'doing the right thing' even 
'more easier'.
   
   So, I propose the following:
   
   - __Introduce a fully classic/connect-agnostic `SparkSessionTest` as the 
default SQL/DataFrame API testing trait__
   - __Encourage the addition of 'connect variants'__ by e.g. adding a linter 
that warns/fails if suite `X` implements `sql.SparkSessionTest` but there is no 
corresponding `class Y extends X with connect.SparkSessionTest` or (if 
possible), automagically generating the connect variant.
   - __Discourage access to internals for newly added suites__ by e.g. adding 
them in subpackage of `o.a.s.sql.test` or maybe even in a separate compilation 
unit that only has access to the scala API and test helpers.
   
   
   ## A prototypical Example
   
   This example aims to demonstrate how a suite that adheres to this style and 
uses `SparkSessionTest` could look like.
   This example is inspired by https://github.com/apache/spark/pull/55571.
   
   ### The main suite
   
   A prototypical suite extends `SparkSessionTest`, which provides `spark` and 
useful helpers like `checkAnswer`.
   
   The class defines both the necessary setup (e.g. creating tables, setting 
confs) and corresponding test cases. The test cases generally consist of Spark 
SQL/DataFrame operations and assertions (either literal `assert`s or e.g. 
`checkAnswer`).
   
   The example suite is _not_ located in the same package as the code under 
test to discourage accessing internals. In this example, internals are accessed 
using a helper object from the package of the code under test.
   
   ```scala
   // located in a different package than the code that is tested
   package org.apache.spark.sql.test.tablecache
   
   import org.apache.spark.sql.tablecache.TableCacheHelper
   // ...
   
   class TableCacheSuite extends SparkSessionTest {
     override def beforeAll = {
       spark.sql(s"CREATE TABLE $testTable (id INT, salary INT) USING 
mockedDSv2")
       spark.sql(s"INSERT INTO $testTable VALUES (1, 100), (10, 1000)")
     }
     override def afterAll  = {
       spark.sql(s"DROP TABLE $testTable")
     }
   
     override def sparkConf: SparkConf =
       super.sparkConf.set( /* set necessary confs */ )
   
   
     // the test mostly consists of DataFrame operations,
     // util calls like `checkAnswer`, and asserts
   
     test("SPARK-54022: cached table pinned against external data write") {
       spark.table(testTable).cache()
       assert(spark.catalog.isCached(testTable))
       checkAnswer(spark.table(testTable), Seq(Row(1, 100)))
   
       // object that accesses internals
       TableCacheHelper.externalAppend(testTable, Row(2, 200))
   
       checkAnswer(spark.table(testTable), Seq(Row(1, 100)))
   
       spark.sql(s"REFRESH TABLE $testTable")
       checkAnswer(spark.table(testTable), Seq(Row(1, 100), Row(2, 200)))
   
       assert(spark.catalog.isCached(testTable))
     }
   
     test("connector w/ cache: temp view stale after external column removal") {
       withView("v") {
         spark.table(testTable).filter("salary < 
999").createOrReplaceTempView("v")
         checkAnswer(spark.table("v"), Seq(Row(1, 100)))
   
         TableCacheHelper.externalDropCol(testTable, "salary")
   
         checkAnswer(spark.table("v"), Seq(Row(1, 100)))
   
         spark.sql(s"REFRESH TABLE $testTable").collect()
         checkError(
           exception = intercept[AnalysisException] { 
spark.table("v").collect() },
           condition = 
"INCOMPATIBLE_COLUMN_CHANGES_AFTER_VIEW_WITH_PLAN_CREATION",
         )
       }
     }
   }
   ```
   
   ### The 'connect variant'
   
   Besides the main suite, there is a 'connect variant', which runs the same 
test via Spark Connect:
   
   
   ```scala
   package org.apache.spark.sql.connect.test.tablecache
   
   class TableCacheConnectSuite extends TableCacheSuite with SparkSessionTest
   ```
   
   Ideally, this variant would be auto-generated, so that 
classic/connect-agnostic testing is the default (i.e. opt-out), rather than 
something that the developer actively has to strive for.
   
   ### The `SparkSessionTest` helper trait
   
   The `SparkSessionTest` used would be the default util trait for 
SparkSession/DataFrame-level tests, providing a `sql.SparkSession` and utils 
like `checkAnswer` and `withTable`.
   
   The `SparkSessionTest` trait is designed to be 'classic/connect-agnostic': 
it only provides utils/APIs that can be used/sensibly overriden in Spark 
Connect, so that we can have a `connect.SparkSessionTest` sibling trait to 
facilitate the implementation of the 'connect variant'.
   
   ```scala
   package org.apache.spark.sql
   
   trait SparkSessionTest extends SparkFunSuite {
     def spark: SparkSession = ...
     def sql: ...
   
     def checkAnswer(df: DataFrame, exp: Seq[Row]) = ...
     def checkError(...) = ...
   
     def withTable(...) = ...
     def withView(...) = ...
     def withTempDir(...) = ...
   
     // TODO can checks on the plan be classic/connect-agnostic?
   }
   ```
   
   This trait aims to supercede `SharedSparkSession` and `QueryTest`. They 
shall be deprecated with the suggestion to migrate towards 
`sql.SparkSessionTest`.
   
   A `classic.SparkSessionTest` can also be added for classic-only tests.
   
   ## What's wrong with `SharedSparkSession` and `QueryTest`?
   
   I want to deprecate `SharedSparkSession` because
   
   - It cannot be overriden with a `connect.SharedSparkSession` so 
classic/connect-agnostic tests would require that testcases are declared in 
traits/abstract classes.
   - It provides a `classic.SparkSession` while currently being the most widely 
used test trait and the easiest thing to reach for. Changing it would break 
downstream (e.g. tests in Delta Lake), so I think its usage needs to be 
actively discouraged.
   - It is widely used so the deprecation note will be widely read and can be 
used to advertise the 'new way'.
   
   I want to deprecate `QueryTest` because 
   
   - it is not fully connect/classic-agnostic.
   - it provides too many peculiar/specific/rarely used methods.
   - it provides implicits.
   
   ## On defaults, simplicity, and friction 
   
   I propose `SparkSessionTest` as _the default testing trait_ to simplify 
things a bit: smaller than `QueryTest` but still a one-stop-shop.
   
   Adding a new Suite should be as simple as possible: `class X extends 
SparkSessionTest`, some setup and _damp_ testcases. No need to declare 
testcases in an abstract class or trait.
   
   ## Structuring test code with inheritance: _variant_ extends _suite_ extends 
_base_
   
   I suggest the following hierarchy for test classes/traits:
   
   1. __Base__: provides generic utils for its suites, contains no testcases 
(here: `SparkSessionTest`)
   2. __Suite__: contains test cases (here: `TableCacheSuite`)
   3. __Variants__: contains overrides (here: `TableCacheConnectSuite`, which 
overrides `spark`)
   
   While this proposal focuses on 'connect variants', I think the idea can be 
generalized to e.g. cover different configuration values like e.g. 'codegen 
on/off'.
   
   ---
   
   <details><summary>Appendix: patterns that this proposal discourages</summary>
   <p>
   
   The following are exaggerations/caricatures of existing patterns that I 
believe
   to be 'suboptimal'. This proposal aims to discourage such patterns.
   
   ### Too much `classic` API usage
   
   `SharedSparkSession` is by far the most widely used test trait 
(extended/mixed
   in ~500 times, while `QueryTest` is used only ~150 times).
   
   The problem: it provides a `classic.SparkSession`, which makes
   'classic/connect-agnostic' testing difficult in two ways:
   First, `def spark` cannot be overriden to use connect.
   Second, usage of classic-only (read: connect-incompatible) APIs is not
   discouraged.
   
   ```
   class FooSuite extends SharedSparkSession {
     test("...") {
       // classic-only, should use `spark.catalog.setCurrentDatabase(db)`
       spark.sessionState.catalogManager.setCurrentNamespace(Array(db))
     }
   }
   ```
   
   With this proposal, `SparkSessionTest` hides the used `classic.SparkSession`
   behind the `sql` interface so accessing `spark.sessionState.catalogManager` 
   would give a "Cannot resolve symbol catalogmanager" error in the IDE.
   
   ### DRY to the bone
   
   Generally, I do not like concisely abstracted tests like the caricature:
   
   ```
   abstract class InsertSuite extends QueryTest with InsertMetricCheck {
     // creates table t in beforeEach, deletes in afterEach
   
     test("insert 3 rows") {
       val data = Seq((1L, "a"), (2L, "b"), (3L, "c"))
       doInsert(t, data)
       checkInsertMetrics(t, numInsertedRows = 3)
       verifyTable(t, data.toDf())
     }
   }
   
   // potentially in another file:
   class InsertSQLSuite extends InsertSuite {
     def doInsert(t, d) = sql(s"INSERT INTO $t VALUES ${d.mkString}")
   }
   class InsertDfwSuite extends InsertSuite {
     def doInsert(t, d) = d.toDf().insert.write.insertInto(t)
   }
   ```
   
   I find such testcases unnecessarily hard to read: the logic is scattered over
   across functions and files.
   
   While I see the appeal in abstracting over the SQL- and DataFrame-APIs, I 
think
   tests should not hide the concrete SQL- and DataFrame-operations.
   (c.f. [Tests and Code Sharing: Damp, Not 
Dry](https://abseil.io/resources/swe-book/html/ch12.html#tests_and_code_sharing_dampcomma_not_dr))
   
   So I think the strawman should be refactored to the following:
   
   ```
   class InsertSuite extends SparkSessionTest {
     test("insert 3 rows via SQL") {
       withTable("foo") {
         spark.sql("CREATE TABLE foo (id INT, name STRING)")
         spark.sql("INSERT INTO foo VALUES (1, 'a'), (2, 'b'), (3, 'c')")
         TableInternalsHelper.getCommits("foo").last match {
           case Insert(numRowsInserted) => assert(numRowsInserted === 3)
           case c => fail(s"expected Insert commit after inserting, but got $c")
         }
         checkAnswer(
           spark.table("foo"),
           Seq(Row(1, "a"), Row(2, "b"), Row(3, "c")),
         )
       }
     }
   
     test("insert 3 rows via dataframe writer API") {
       withTable("foo") {
         spark.sql("CREATE TABLE foo (id INT, name STRING)")
         spark.createDataFrame(Seq(Row(1, "a"), Row(2, "b"), Row(3, "c")))
           .write
           .insertInto("foo")
         TableInternalsHelper.getCommits("foo").last match {
           case Insert(numRowsInserted) => assert(numRowsInserted === 3)
           case c => fail(s"expected Insert commit after inserting, but got $c")
         }
         checkAnswer(
           spark.table("foo"),
           Seq(Row(1, "a"), Row(2, "b"), Row(3, "c")),
         )
       }
     }
   }
   ```
   
   ### Long inheritance chains / convoluted matrix tests
   
   When Inspecting the Type Hierarchy of e.g. SharedSparkSession, there are 
quite
   a few _deep_ inheritance chains.
   
   The following shows the chain of `DeltaBasedMergeIntoTableSuiteBase`. The 
`...Base` classes are abstract.
   
   ```
   RowLevelOperationSuiteBase                   0 cases  setup and helpers
   .MergeIntoTableSuiteBase                    71 cases
   ..DeltaBasedMergeIntoTableSuiteBase         +7 cases  overrides expected 
metrics
   ...DeltaBasedMergeIntoTableSuite            +8 cases  sets 'supports-deltas'
   ...DeltaBasedMergeIntoTableUpdeAsDelete...  +1 case   sets 
'supports-deltas', 'split-updates'
   ....DeltaBasedMergeIntoTableNoCodegenSuite  +0 cases  disables codegen via 
sparkConf
   ```
   
   As an ideal, I think we should strive for 3 levels (Base, Suite, Variants), 
like e.g.:
   
   ```
   RowLevelOperationSuiteBase
   .MergeIntoTableSuite
   ..MergeIntoTableUpdeAsDeleteAndInsertSuite
   ..MergeIntoTableNoCodegenSuite
   ..MergeIntoTableDeltaBasedSuite
   ..MergeIntoTableDeltaBasedUpdeAsDeleteA...
   ..MergeIntoTableDeltaBasedNoCodegenSuite
   ```
   
   ### Awkward classic/connect-agnostic tests
   
   Some tests introdoce new helper constructs to achieve 
classic/connect-agnosticity.
   
   ```
     test("foo") {
       withTestSession { session =>
         session.sql(s"CREATE TABLE $testTable (id INT, salary INT) USING 
foo").collect()
         session.sql(s"INSERT INTO $testTable VALUES (1, 100)").collect()
         checkRows(session.sql(s"SELECT * FROM $testTable"), Seq(Row(1, 100)))
         checkRows(session.sql(s"SELECT * FROM $testTable"), Seq(Row(1, 100), 
Row(2, 200)))
       }
     }
   ```
   
   [This 
diff](https://github.com/apache/spark/pull/56190/changes/6b5dba77015378c074f5bfbd6c26304a1675abff..6f509d8a2c5dfaeb60df8e8485a3e1e7083efc24)
 shows that these new helpers are not necessary.
   
   </p>
   </details> 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to