[GitHub] spark pull request #7625: [WIP] [skip ci] Fuzz testing in Spark SQL

JoshRosen Mon, 15 Aug 2016 17:43:16 -0700

GitHub user JoshRosen reopened a pull request:

    https://github.com/apache/spark/pull/7625


    [WIP] [skip ci] Fuzz testing in Spark SQL

    [skip ci]
    
    This is a WIP pull request for some expression fuzz testing code that I'm 
working on as part of a 
    hackathon.  I'm creating this pull request now in order to share the code 
and to have a pull request that I can reference from my other pull requests for 
fixing bugs that were found using this tester.
    
    ### Features on my TODO list
    
    - Better logging to aid debuggability.
    - "Continuous" mode which dumps all results to files and keeps going when 
errors occur (designed to run overnight).
    - Validator which asserts that random queries return equivalent answers 
when run under different configuration modes (safe vs. unsafe vs safe w/o 
codegen, plus a few other permutations).
    - Plan transformer which takes valid logical query plans and transforms the 
into equivalent ones, then checks that both the original and transformed plans 
produce equivalent answers. This style of test is used in MySQL's testing tools.
    
    ### List of potential bugs found during this testing
    
    Note that most of these bugs are problems in analysis error reporting and 
not legitimate bugs in query execution. This tool isn't really capable of 
finding "wrong answer" bugs yet because it lacks an oracle for determining what 
the proper query answers are.
    
    (:white_check_mark: indicates fixed, :construction: indicates a fix in 
progress)
    
    #### Analysis issues:
    
    - The `createDataFrame()` methods should guard against `null` values being 
passed in (e.g. the user passes `null` instead of `Row`).
    - :white_check_mark: The analyzer should check that join conditions have 
BooleanType: #7630. 
    - :white_check_mark: The analyzer should ensure that set operations (union, 
intersect, and except) are only performed on tables that have the same number 
of columns: #7631
    - :white_check_mark: Sorting based on array-typed columns should print an 
error at analysis time, not runtime. #7633 
    - :white_check_mark: - DataFrame.orderBy gives confusing analysis errors 
when ordering based on nested columns: 
https://issues.apache.org/jira/browse/SPARK-9323
    - The `DATAFRAME_EAGER_ANALYSIS` configuration flag does not work properly 
in all cases: there are still many corner-cases where invalid queries will 
eagerly throw analysis errors.
    - Type mismatches in joins are sometimes confusing. Let's say that we have 
two RDDs with columns that have the same name, but where one column is a struct 
and another is a boolean. If we try to join on a nested field then this can 
result in a confusing "Can't extract value" message instead of a more 
informative message that explains that the types are mismatched:
    
       ```scala
       val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": 
{"b": 1}}""" :: Nil))
    val df2 = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": 
false}""" :: Nil))
    df.join(df2, "a.b")
    
       org.apache.spark.sql.AnalysisException: Can't extract value from a#26607;
        at 
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:63)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:264)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:263)
        at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
        at scala.collection.immutable.List.foldLeft(List.scala:84)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:127)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:137)
        at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
        at org.apache.spark.sql.DataFrame.join(DataFrame.scala:404)
    ```
    
    
    #### Execution issues:
    - Scala's BigDecimal can lead to OutOfMemoryErrors when computing rows' 
hashCodes (this is caused by 
[SI-6173](https://issues.scala-lang.org/browse/SI-6173)): #7635 
    - :white_check_mark: : When SortMergeJoin is enabled, we may get runtime 
crashes when attempting to join on struct field: #7645.
    - :white_check_mark: CatalystTypeConverters.toScala may throw 
UnsupportedOperationException when applied to an UnsafeRow: #7682
    - :white_check_mark: [TungstenProject code generation fails for 
`array<binary>` columns](https://issues.apache.org/jira/browse/SPARK-10038).
    
    
    #### Expression issues:
    - ~~`UTF8String.repeat` can throw `NegativeArraySizeException` when applied 
to random bytes which have been casted to a string.~~ This is caused by extreme 
array sizes which overflow intmax.
    - `UTF8String.reverse` can throw `ArrayIndexOutOfBoundsException` when 
applied to random bytes which have been casted to a string.
    - :white_check_mark:  The methods in the `Unevaluable` trait should be 
final and the some of the new aggregate functions should inherit from this 
trait (#7627). 
    - For extremely small inputs, the results of the Remainder expression can 
differ in the codegen and non-codegen paths:
    
       ```
       (CAST(-2147483648, FloatType) % -1.8938038E-30) (types: List(FloatType, 
FloatType) [-4.0832423E-31] did not equal [-8.263847E-31]
       ```
    
       This is most likely a numeric stability issue.
    - Code generation frequently crashes for expressions containing null 
literals, but this isn't a problem that will impact users due to our codegen 
fallback path.
    - :white_check_mark: `NaNvl` should check that its two arguments are of the 
same floating point type: #7882
    - :white_check_mark: Code-generated numeric comparison expressions may fail 
to compile for Boolean types: #7882
    
    ##### Minor UX issues:
    
    - The ORC writer could log a more informative error message when the user 
isn't using a HiveSQLContext:
    
       ```
    java.lang.ClassNotFoundException: 
org.apache.spark.sql.hive.orc.DefaultSource
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ddl.scala:206)
        at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ddl.scala:313)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
       ```
    - Confusing unresolved alias errors are thrown somewhat later in analysis 
than I'd expect. Ideally we would never see `UnresolvedException: Invalid call 
to dataType on unresolved object` since we would have ideally checked for 
resolution before inspecting the data types.
      - `dropDuplicates` seems especially prone to this problem.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark fuzz-test

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7625.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7625
    
----
commit 6f2b909e425aa8b00e386ca252f6830dc5a38e41
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T01:03:07Z

    Fix SPARK-9292.

commit 03120d5b48e94e164ea4e8182c6acc0d08eb204e
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T01:10:15Z

    Check condition type in resolved()

commit e1f462ef3abec729ad8a533e98a5465c5ccb57b4
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T02:14:43Z

    Initial commit for SQL expression fuzzing harness

commit f8daec768cd08a05a8b6564ac74d2fe8ced4b498
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T03:03:46Z

    Apply implicit casts (in a hacky way for now)

commit df00e7a2474a1deb0bf7fb5a6a4719d1026ffbd5
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T22:40:21Z

    More messy WIP prototyping on expression fuzzing

commit 2dcbc108e4da5671d03d7eefe9c91521618a94f9
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T23:02:00Z

    Add some comments; speed up classpath search

commit c20a67997a26f066b9ef9e627927977215878f5c
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T23:07:37Z

    Move dummy type coercion to a helper method

commit 95860dee6c2784869ae06a7acad1d4dc52eb7aec
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T23:30:59Z

    More code cleanup and comments

commit abaed51744c7183f53629b28be5ed49ecdb28fff
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T23:39:53Z

    Use non-mutable interpreted projection.

commit 129ad6c0d3c3bb6682a78682e76b0133a0e41eff
Author: Josh Rosen <[email protected]>
Date:   2015-07-11T23:44:26Z

    Log expression after coercion

commit e1f91df228423d95f5ad6904819ce3d8bc60f09c
Author: Josh Rosen <[email protected]>
Date:   2015-07-13T00:50:07Z

    Run tests in deterministic order

commit adc3c7f34c866f469d1f2d92e44568943539a784
Author: Josh Rosen <[email protected]>
Date:   2015-07-23T20:31:06Z

    Test with random inputs of all types

commit ae5e1510e08a40e78c7aa1ca6669f4ce414d2b70
Author: Josh Rosen <[email protected]>
Date:   2015-07-23T22:28:48Z

    Ignore BinaryType for now, since it led to some spurious failures.

commit a35420840bd4bbd13a678ea24feca5129fa3796d
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T00:07:23Z

    Begin to add a DataFrame API fuzzer.

commit 13f8c560b7103a13d47a673eaaef152830bddeed
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T00:11:33Z

    Don't puts nulls into the DataFrame

commit dd16f4dd5f6ffe1247c03b62bfddc5e7130d9ae5
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T01:01:54Z

    Print logical plans.

commit 7f2b771f7fe36bf27c8cc7e680fdbff252bad3d1
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T02:33:14Z

    Fuzzer improvements.

commit 326d759c0a407a2ca9ea8a946ff84764031929b3
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T02:33:50Z

    Fix SPARK-9293

commit 4a2c684bff8af2f0fb051dd133b36b8755bd6891
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T03:06:22Z

    Merge branch 'SPARK-9293' into fuzz-test

commit 37e4ce82807930f207066cca7efbd3195bd3d17e
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T03:31:06Z

    Support methods that take varargs Column parameters.

commit 558f04ad1c2cefb59c91137400e9889fe7e05fed
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T03:46:58Z

    Merge remote-tracking branch 'origin/master' into fuzz-test

commit 2f1b802839b88e3850d3333892fa23127b04486f
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T04:36:04Z

    Add analysis rule to detect sorting on unsupported column types (SPARK-9295)

commit c0889c0fef61d7c44020737973ee540bba6c4793
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T04:38:25Z

    Merge branch 'SPARK-9295' into fuzz-test

commit d7a35358e2068eca9bdead2b93f3b96dcaf890d8
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T06:02:13Z

    [SPARK-9303] Decimal should use java.math.Decimal directly instead of via 
Scala wrapper

commit 74bbc8c9739dbd4d52de76e75699fee8e0d0533e
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T06:07:44Z

    Merge branch 'SPARK-9303' into fuzz-test

commit bfe1451ec40bcf20f85c5b6fc7bcc1f23bdc6c91
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T06:18:17Z

    Update to allow sorting by null literals

commit 7a7ec4dc3d1161de6b013499d9ea19f362edae32
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T07:09:08Z

    Merge branch 'SPARK-9295' into fuzz-test

commit 55221fa51136920a11da22690ae53c59c865a7a7
Author: Liang-Chi Hsieh <[email protected]>
Date:   2015-07-24T15:17:43Z

    Shouldn't use SortMergeJoin when joining on unsortable columns.

commit a2407074dc34672f71c33d671d285809969bfc78
Author: Liang-Chi Hsieh <[email protected]>
Date:   2015-07-24T15:58:26Z

    Use forall instead of exists for readability.

commit dc94314444faceaa9e620b3b8f047977420132ac
Author: Josh Rosen <[email protected]>
Date:   2015-07-24T16:29:56Z

    Merge remote-tracking branch 'origin/pr/7645/head' into fuzz-test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #7625: [WIP] [skip ci] Fuzz testing in Spark SQL

Reply via email to