fwc commented on code in PR #56190:
URL: https://github.com/apache/spark/pull/56190#discussion_r3359763783
##########
sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:
##########
@@ -1211,7 +1211,7 @@ object QueryTest extends Assertions {
}
-class QueryTestSuite extends test.SharedSparkSession {
+class QueryTestSuite extends QueryTest with SparkSessionBinder {
Review Comment:
For scala tests in `sql/core`, I envision a test style (and corresponding
infra) that encourages:
1. Testing at the [Spark SQL/DataFrame
API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/index.html)
level (i.e. discourages/prevents accessing internals)
1. Targeting both classic _and connect_
1. A rather [DAMP than
DRY](https://abseil.io/resources/swe-book/html/ch12.html#tests_and_code_sharing_dampcomma_not_dr)
style (i.e. tests as `spark` and `df` ops + checks with little/no
'abstraction')
While _DAMP, not DRY_ would mostly be enshrined in guidelines like
`AGENTS.md`, a testing style guide, or a `sql/core/src/test/README.md`, I think
that _testing the API using classic and connect_ can be made easier by further
tweaking and extending the testing APIs / traits.
In some sense, this 'proposal' isn't proposing something 'new': Many (most?)
suites are already more _DAMP_ than _DRY_ and with the introduction of
`sql.SparkSessionProvider` and its usage in `QueryTest`, writing
classic/connect-agnostic tests is already much easier than before. But (a)
afaics _DAMP NOT DRY_ is not (yet) codified anywhere and (b) I think the
testing traits can be further improved to make 'doing the right thing' even
'more easier'.
So, I propose the following:
- __Introduce a fully classic/connect-agnostic `SparkSessionTest` as the
default SQL/DataFrame API testing trait__
- __Encourage the addition of 'connect variants'__ by e.g. adding a linter
that warns/fails if suite `X` implements `sql.SparkSessionTest` but there is no
corresponding `class Y extends X with connect.SparkSessionTest` or (if
possible), automagically generating the connect variant.
- __Discourage access to internals for newly added suites__ by e.g. adding
them in subpackage of `o.a.s.sql.test` or maybe even in a separate compilation
unit that only has access to the scala API and test helpers.
## A prototypical Example
This example aims to demonstrate how a suite that adheres to this style and
uses `SparkSessionTest` could look like.
This example is inspired by https://github.com/apache/spark/pull/55571.
### The main suite
A prototypical suite extends `SparkSessionTest`, which provides `spark` and
useful helpers like `checkAnswer`.
The class defines both the necessary setup (e.g. creating tables, setting
confs) and corresponding test cases. The test cases generally consist of Spark
SQL/DataFrame operations and assertions (either literal `assert`s or e.g.
`checkAnswer`).
The example suite is _not_ located in the same package as the code under
test to discourage accessing internals. In this example, internals are accessed
using a helper object from the package of the code under test.
```scala
// located in a different package than the code that is tested
package org.apache.spark.sql.test.tablecache
import org.apache.spark.sql.tablecache.TableCacheHelper
// ...
class TableCacheSuite extends SparkSessionTest {
override def beforeAll = {
spark.sql(s"CREATE TABLE $testTable (id INT, salary INT) USING
mockedDSv2")
spark.sql(s"INSERT INTO $testTable VALUES (1, 100), (10, 1000)")
}
override def afterAll = {
spark.sql(s"DROP TABLE $testTable")
}
override def sparkConf: SparkConf =
super.sparkConf.set( /* set necessary confs */ )
// the test mostly consists of DataFrame operations,
// util calls like `checkAnswer`, and asserts
test("SPARK-54022: cached table pinned against external data write") {
spark.table(testTable).cache()
assert(spark.catalog.isCached(testTable))
checkAnswer(spark.table(testTable), Seq(Row(1, 100)))
// object that accesses internals
TableCacheHelper.externalAppend(testTable, Row(2, 200))
checkAnswer(spark.table(testTable), Seq(Row(1, 100)))
spark.sql(s"REFRESH TABLE $testTable")
checkAnswer(spark.table(testTable), Seq(Row(1, 100), Row(2, 200)))
assert(spark.catalog.isCached(testTable))
}
test("connector w/ cache: temp view stale after external column removal") {
withView("v") {
spark.table(testTable).filter("salary <
999").createOrReplaceTempView("v")
checkAnswer(spark.table("v"), Seq(Row(1, 100)))
TableCacheHelper.externalDropCol(testTable, "salary")
checkAnswer(spark.table("v"), Seq(Row(1, 100)))
spark.sql(s"REFRESH TABLE $testTable").collect()
checkError(
exception = intercept[AnalysisException] {
spark.table("v").collect() },
condition =
"INCOMPATIBLE_COLUMN_CHANGES_AFTER_VIEW_WITH_PLAN_CREATION",
)
}
}
}
```
### The 'connect variant'
Besides the main suite, there is a 'connect variant', which runs the same
test via Spark Connect:
```scala
package org.apache.spark.sql.connect.test.tablecache
class TableCacheConnectSuite extends TableCacheSuite with SparkSessionTest
```
Ideally, this variant would be auto-generated, so that
classic/connect-agnostic testing is the default (i.e. opt-out), rather than
something that the developer actively has to strive for.
### The `SparkSessionTest` helper trait
The `SparkSessionTest` used would be the default util trait for
SparkSession/DataFrame-level tests, providing a `sql.SparkSession` and utils
like `checkAnswer` and `withTable`.
The `SparkSessionTest` trait is designed to be 'classic/connect-agnostic':
it only provides utils/APIs that can be used/sensibly overriden in Spark
Connect, so that we can have a `connect.SparkSessionTest` sibling trait to
facilitate the implementation of the 'connect variant'.
```scala
package org.apache.spark.sql
trait SparkSessionTest extends SparkFunSuite {
def spark: SparkSession = ...
def sql: ...
def checkAnswer(df: DataFrame, exp: Seq[Row]) = ...
def checkError(...) = ...
def withTable(...) = ...
def withView(...) = ...
def withTempDir(...) = ...
// TODO can checks on the plan be classic/connect-agnostic?
}
```
This trait aims to supercede `SharedSparkSession` and `QueryTest`. They
shall be deprecated with the suggestion to migrate towards
`sql.SparkSessionTest`.
A `classic.SparkSessionTest` can also be added for classic-only tests.
## What's wrong with `SharedSparkSession` and `QueryTest`?
I want to deprecate `SharedSparkSession` because
- It cannot be overriden with a `connect.SharedSparkSession` so
classic/connect-agnostic tests would require that testcases are declared in
traits/abstract classes.
- It provides a `classic.SparkSession` while currently being the most widely
used test trait and the easiest thing to reach for. Changing it would break
downstream (e.g. tests in Delta Lake), so I think its usage needs to be
actively discouraged.
- It is widely used so the deprecation note will be widely read and can be
used to advertise the 'new way'.
I want to deprecate `QueryTest` because
- it is not fully connect/classic-agnostic.
- it provides too many peculiar/specific/rarely used methods.
- it provides implicits.
## On defaults, simplicity, and friction
I propose `SparkSessionTest` as _the default testing trait_ to simplify
things a bit: smaller than `QueryTest` but still a one-stop-shop.
Adding a new Suite should be as simple as possible: `class X extends
SparkSessionTest`, some setup and _damp_ testcases. No need to declare
testcases in an abstract class or trait.
## Structuring test code with inheritance: _variant_ extends _suite_ extends
_base_
I suggest the following hierarchy for test classes/traits:
1. __Base__: provides generic utils for its suites, contains no testcases
(here: `SparkSessionTest`)
2. __Suite__: contains test cases (here: `TableCacheSuite`)
3. __Variants__: contains overrides (here: `TableCacheConnectSuite`, which
overrides `spark`)
While this proposal focuses on 'connect variants', I think the idea can be
generalized to e.g. cover different configuration values like e.g. 'codegen
on/off'.
---
<details><summary>Appendix: patterns that this proposal discourages</summary>
<p>
The following are exaggerations/caricatures of existing patterns that I
believe
to be 'suboptimal'. This proposal aims to discourage such patterns.
### Too much `classic` API usage
`SharedSparkSession` is by far the most widely used test trait
(extended/mixed
in ~500 times, while `QueryTest` is used only ~150 times).
The problem: it provides a `classic.SparkSession`, which makes
'classic/connect-agnostic' testing difficult in two ways:
First, `def spark` cannot be overriden to use connect.
Second, usage of classic-only (read: connect-incompatible) APIs is not
discouraged.
```
class FooSuite extends SharedSparkSession {
test("...") {
// classic-only, should use `spark.catalog.setCurrentDatabase(db)`
spark.sessionState.catalogManager.setCurrentNamespace(Array(db))
}
}
```
With this proposal, `SparkSessionTest` hides the used `classic.SparkSession`
behind the `sql` interface so accessing `spark.sessionState.catalogManager`
would give a "Cannot resolve symbol catalogmanager" error in the IDE.
### DRY to the bone
Generally, I do not like concisely abstracted tests like the caricature:
```
abstract class InsertSuite extends QueryTest with InsertMetricCheck {
// creates table t in beforeEach, deletes in afterEach
test("insert 3 rows") {
val data = Seq((1L, "a"), (2L, "b"), (3L, "c"))
doInsert(t, data)
checkInsertMetrics(t, numInsertedRows = 3)
verifyTable(t, data.toDf())
}
}
// potentially in another file:
class InsertSQLSuite extends InsertSuite {
def doInsert(t, d) = sql(s"INSERT INTO $t VALUES ${d.mkString}")
}
class InsertDfwSuite extends InsertSuite {
def doInsert(t, d) = d.toDf().insert.write.insertInto(t)
}
```
I find such testcases unnecessarily hard to read: the logic is scattered over
across functions and files.
While I see the appeal in abstracting over the SQL- and DataFrame-APIs, I
think
tests should not hide the concrete SQL- and DataFrame-operations.
(c.f. [Tests and Code Sharing: Damp, Not
Dry](https://abseil.io/resources/swe-book/html/ch12.html#tests_and_code_sharing_dampcomma_not_dr))
So I think the strawman should be refactored to the following:
```
class InsertSuite extends SparkSessionTest {
test("insert 3 rows via SQL") {
withTable("foo") {
spark.sql("CREATE TABLE foo (id INT, name STRING)")
spark.sql("INSERT INTO foo VALUES (1, 'a'), (2, 'b'), (3, 'c')")
TableInternalsHelper.getCommits("foo").last match {
case Insert(numRowsInserted) => assert(numRowsInserted === 3)
case c => fail(s"expected Insert commit after inserting, but got $c")
}
checkAnswer(
spark.table("foo"),
Seq(Row(1, "a"), Row(2, "b"), Row(3, "c")),
)
}
}
test("insert 3 rows via dataframe writer API") {
withTable("foo") {
spark.sql("CREATE TABLE foo (id INT, name STRING)")
spark.createDataFrame(Seq(Row(1, "a"), Row(2, "b"), Row(3, "c")))
.write
.insertInto("foo")
TableInternalsHelper.getCommits("foo").last match {
case Insert(numRowsInserted) => assert(numRowsInserted === 3)
case c => fail(s"expected Insert commit after inserting, but got $c")
}
checkAnswer(
spark.table("foo"),
Seq(Row(1, "a"), Row(2, "b"), Row(3, "c")),
)
}
}
}
```
### Long inheritance chains / convoluted matrix tests
When Inspecting the Type Hierarchy of e.g. SharedSparkSession, there are
quite
a few _deep_ inheritance chains.
The following shows the chain of `DeltaBasedMergeIntoTableSuiteBase`. The
`...Base` classes are abstract.
```
RowLevelOperationSuiteBase 0 cases setup and helpers
.MergeIntoTableSuiteBase 71 cases
..DeltaBasedMergeIntoTableSuiteBase +7 cases overrides expected
metrics
...DeltaBasedMergeIntoTableSuite +8 cases sets 'supports-deltas'
...DeltaBasedMergeIntoTableUpdeAsDelete... +1 case sets
'supports-deltas', 'split-updates'
....DeltaBasedMergeIntoTableNoCodegenSuite +0 cases disables codegen via
sparkConf
```
As an ideal, I think we should strive for 3 levels (Base, Suite, Variants),
like e.g.:
```
RowLevelOperationSuiteBase
.MergeIntoTableSuite
..MergeIntoTableUpdeAsDeleteAndInsertSuite
..MergeIntoTableNoCodegenSuite
..MergeIntoTableDeltaBasedSuite
..MergeIntoTableDeltaBasedUpdeAsDeleteA...
..MergeIntoTableDeltaBasedNoCodegenSuite
```
### Awkward classic/connect-agnostic tests
Some tests introdoce new helper constructs to achieve
classic/connect-agnosticity.
```
test("foo") {
withTestSession { session =>
session.sql(s"CREATE TABLE $testTable (id INT, salary INT) USING
foo").collect()
session.sql(s"INSERT INTO $testTable VALUES (1, 100)").collect()
checkRows(session.sql(s"SELECT * FROM $testTable"), Seq(Row(1, 100)))
checkRows(session.sql(s"SELECT * FROM $testTable"), Seq(Row(1, 100),
Row(2, 200)))
}
}
```
[This
diff](https://github.com/apache/spark/pull/56190/changes/6b5dba77015378c074f5bfbd6c26304a1675abff..6f509d8a2c5dfaeb60df8e8485a3e1e7083efc24)
shows that these new helpers are not necessary.
</p>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]