GitHub user JoshRosen reopened a pull request:
https://github.com/apache/spark/pull/7625
[WIP] [skip ci] Fuzz testing in Spark SQL
[skip ci]
This is a WIP pull request for some expression fuzz testing code that I'm
working on as part of a
hackathon. I'm creating this pull request now in order to share the code
and to have a pull request that I can reference from my other pull requests for
fixing bugs that were found using this tester.
### Features on my TODO list
- Better logging to aid debuggability.
- "Continuous" mode which dumps all results to files and keeps going when
errors occur (designed to run overnight).
- Validator which asserts that random queries return equivalent answers
when run under different configuration modes (safe vs. unsafe vs safe w/o
codegen, plus a few other permutations).
- Plan transformer which takes valid logical query plans and transforms the
into equivalent ones, then checks that both the original and transformed plans
produce equivalent answers. This style of test is used in MySQL's testing tools.
### List of potential bugs found during this testing
Note that most of these bugs are problems in analysis error reporting and
not legitimate bugs in query execution. This tool isn't really capable of
finding "wrong answer" bugs yet because it lacks an oracle for determining what
the proper query answers are.
(:white_check_mark: indicates fixed, :construction: indicates a fix in
progress)
#### Analysis issues:
- The `createDataFrame()` methods should guard against `null` values being
passed in (e.g. the user passes `null` instead of `Row`).
- :white_check_mark: The analyzer should check that join conditions have
BooleanType: #7630.
- :white_check_mark: The analyzer should ensure that set operations (union,
intersect, and except) are only performed on tables that have the same number
of columns: #7631
- :white_check_mark: Sorting based on array-typed columns should print an
error at analysis time, not runtime. #7633
- :white_check_mark: - DataFrame.orderBy gives confusing analysis errors
when ordering based on nested columns:
https://issues.apache.org/jira/browse/SPARK-9323
- The `DATAFRAME_EAGER_ANALYSIS` configuration flag does not work properly
in all cases: there are still many corner-cases where invalid queries will
eagerly throw analysis errors.
- Type mismatches in joins are sometimes confusing. Let's say that we have
two RDDs with columns that have the same name, but where one column is a struct
and another is a boolean. If we try to join on a nested field then this can
result in a confusing "Can't extract value" message instead of a more
informative message that explains that the types are mismatched:
```scala
val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a":
{"b": 1}}""" :: Nil))
val df2 = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a":
false}""" :: Nil))
df.join(df2, "a.b")
org.apache.spark.sql.AnalysisException: Can't extract value from a#26607;
at
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:63)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:264)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:263)
at
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:263)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:127)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:137)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:404)
```
#### Execution issues:
- Scala's BigDecimal can lead to OutOfMemoryErrors when computing rows'
hashCodes (this is caused by
[SI-6173](https://issues.scala-lang.org/browse/SI-6173)): #7635
- :white_check_mark: : When SortMergeJoin is enabled, we may get runtime
crashes when attempting to join on struct field: #7645.
- :white_check_mark: CatalystTypeConverters.toScala may throw
UnsupportedOperationException when applied to an UnsafeRow: #7682
- :white_check_mark: [TungstenProject code generation fails for
`array<binary>` columns](https://issues.apache.org/jira/browse/SPARK-10038).
#### Expression issues:
- ~~`UTF8String.repeat` can throw `NegativeArraySizeException` when applied
to random bytes which have been casted to a string.~~ This is caused by extreme
array sizes which overflow intmax.
- `UTF8String.reverse` can throw `ArrayIndexOutOfBoundsException` when
applied to random bytes which have been casted to a string.
- :white_check_mark: The methods in the `Unevaluable` trait should be
final and the some of the new aggregate functions should inherit from this
trait (#7627).
- For extremely small inputs, the results of the Remainder expression can
differ in the codegen and non-codegen paths:
```
(CAST(-2147483648, FloatType) % -1.8938038E-30) (types: List(FloatType,
FloatType) [-4.0832423E-31] did not equal [-8.263847E-31]
```
This is most likely a numeric stability issue.
- Code generation frequently crashes for expressions containing null
literals, but this isn't a problem that will impact users due to our codegen
fallback path.
- :white_check_mark: `NaNvl` should check that its two arguments are of the
same floating point type: #7882
- :white_check_mark: Code-generated numeric comparison expressions may fail
to compile for Boolean types: #7882
##### Minor UX issues:
- The ORC writer could log a more informative error message when the user
isn't using a HiveSQLContext:
```
java.lang.ClassNotFoundException:
org.apache.spark.sql.hive.orc.DefaultSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ddl.scala:206)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ddl.scala:313)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
```
- Confusing unresolved alias errors are thrown somewhat later in analysis
than I'd expect. Ideally we would never see `UnresolvedException: Invalid call
to dataType on unresolved object` since we would have ideally checked for
resolution before inspecting the data types.
- `dropDuplicates` seems especially prone to this problem.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark fuzz-test
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7625.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7625
----
commit 6f2b909e425aa8b00e386ca252f6830dc5a38e41
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T01:03:07Z
Fix SPARK-9292.
commit 03120d5b48e94e164ea4e8182c6acc0d08eb204e
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T01:10:15Z
Check condition type in resolved()
commit e1f462ef3abec729ad8a533e98a5465c5ccb57b4
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T02:14:43Z
Initial commit for SQL expression fuzzing harness
commit f8daec768cd08a05a8b6564ac74d2fe8ced4b498
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T03:03:46Z
Apply implicit casts (in a hacky way for now)
commit df00e7a2474a1deb0bf7fb5a6a4719d1026ffbd5
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T22:40:21Z
More messy WIP prototyping on expression fuzzing
commit 2dcbc108e4da5671d03d7eefe9c91521618a94f9
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T23:02:00Z
Add some comments; speed up classpath search
commit c20a67997a26f066b9ef9e627927977215878f5c
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T23:07:37Z
Move dummy type coercion to a helper method
commit 95860dee6c2784869ae06a7acad1d4dc52eb7aec
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T23:30:59Z
More code cleanup and comments
commit abaed51744c7183f53629b28be5ed49ecdb28fff
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T23:39:53Z
Use non-mutable interpreted projection.
commit 129ad6c0d3c3bb6682a78682e76b0133a0e41eff
Author: Josh Rosen <[email protected]>
Date: 2015-07-11T23:44:26Z
Log expression after coercion
commit e1f91df228423d95f5ad6904819ce3d8bc60f09c
Author: Josh Rosen <[email protected]>
Date: 2015-07-13T00:50:07Z
Run tests in deterministic order
commit adc3c7f34c866f469d1f2d92e44568943539a784
Author: Josh Rosen <[email protected]>
Date: 2015-07-23T20:31:06Z
Test with random inputs of all types
commit ae5e1510e08a40e78c7aa1ca6669f4ce414d2b70
Author: Josh Rosen <[email protected]>
Date: 2015-07-23T22:28:48Z
Ignore BinaryType for now, since it led to some spurious failures.
commit a35420840bd4bbd13a678ea24feca5129fa3796d
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T00:07:23Z
Begin to add a DataFrame API fuzzer.
commit 13f8c560b7103a13d47a673eaaef152830bddeed
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T00:11:33Z
Don't puts nulls into the DataFrame
commit dd16f4dd5f6ffe1247c03b62bfddc5e7130d9ae5
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T01:01:54Z
Print logical plans.
commit 7f2b771f7fe36bf27c8cc7e680fdbff252bad3d1
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T02:33:14Z
Fuzzer improvements.
commit 326d759c0a407a2ca9ea8a946ff84764031929b3
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T02:33:50Z
Fix SPARK-9293
commit 4a2c684bff8af2f0fb051dd133b36b8755bd6891
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T03:06:22Z
Merge branch 'SPARK-9293' into fuzz-test
commit 37e4ce82807930f207066cca7efbd3195bd3d17e
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T03:31:06Z
Support methods that take varargs Column parameters.
commit 558f04ad1c2cefb59c91137400e9889fe7e05fed
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T03:46:58Z
Merge remote-tracking branch 'origin/master' into fuzz-test
commit 2f1b802839b88e3850d3333892fa23127b04486f
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T04:36:04Z
Add analysis rule to detect sorting on unsupported column types (SPARK-9295)
commit c0889c0fef61d7c44020737973ee540bba6c4793
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T04:38:25Z
Merge branch 'SPARK-9295' into fuzz-test
commit d7a35358e2068eca9bdead2b93f3b96dcaf890d8
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T06:02:13Z
[SPARK-9303] Decimal should use java.math.Decimal directly instead of via
Scala wrapper
commit 74bbc8c9739dbd4d52de76e75699fee8e0d0533e
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T06:07:44Z
Merge branch 'SPARK-9303' into fuzz-test
commit bfe1451ec40bcf20f85c5b6fc7bcc1f23bdc6c91
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T06:18:17Z
Update to allow sorting by null literals
commit 7a7ec4dc3d1161de6b013499d9ea19f362edae32
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T07:09:08Z
Merge branch 'SPARK-9295' into fuzz-test
commit 55221fa51136920a11da22690ae53c59c865a7a7
Author: Liang-Chi Hsieh <[email protected]>
Date: 2015-07-24T15:17:43Z
Shouldn't use SortMergeJoin when joining on unsortable columns.
commit a2407074dc34672f71c33d671d285809969bfc78
Author: Liang-Chi Hsieh <[email protected]>
Date: 2015-07-24T15:58:26Z
Use forall instead of exists for readability.
commit dc94314444faceaa9e620b3b8f047977420132ac
Author: Josh Rosen <[email protected]>
Date: 2015-07-24T16:29:56Z
Merge remote-tracking branch 'origin/pr/7645/head' into fuzz-test
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]