GitHub user lnmohankumar opened a pull request:
https://github.com/apache/spark/pull/17456
Branch 2.1
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17456.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17456
----
commit 5693ac8e5bd5df8aca1b0d6df0be072a45abcfbd
Author: Xiangrui Meng <[email protected]>
Date: 2016-12-14T00:59:09Z
[SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes
## What changes were proposed in this pull request?
Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content
minimal since users can type `?spark.randomForest` to see the full doc.
cc: jkbradley
Author: Xiangrui Meng <[email protected]>
Closes #16264 from mengxr/SPARK-18793.
(cherry picked from commit 594b14f1ebd0b3db9f630e504be92228f11b4d9f)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 019d1fa3d421b5750170429fc07b204692b7b58e
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-14T02:36:36Z
[SPARK-18588][TESTS] Ignore KafkaSourceStressForDontFailOnDataLossSuite
## What changes were proposed in this pull request?
Disable KafkaSourceStressForDontFailOnDataLossSuite for now.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #16275 from zsxwing/ignore-flaky-test.
(cherry picked from commit e104e55c16e229e521c517393b8163cbc3bbf85a)
Signed-off-by: Reynold Xin <[email protected]>
commit 8ef005931a242d087f4879805571be0660aefaf9
Author: [email protected] <[email protected]>
Date: 2016-12-14T02:52:05Z
[MINOR][SPARKR] fix kstest example error and add unit test
## What changes were proposed in this pull request?
While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;
Fix the example errors;
Add a new unit test for print.summary.KStest;
## How was this patch tested?
Manual test;
Add new unit test;
Author: [email protected] <[email protected]>
Closes #16259 from wangmiao1981/ks.
(cherry picked from commit f2ddabfa09fda26ff0391d026dd67545dab33e01)
Signed-off-by: Yanbo Liang <[email protected]>
commit f999312e72940b559738048646013eec9e68d657
Author: Nattavut Sutyanyong <[email protected]>
Date: 2016-12-14T10:09:31Z
[SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32
## What changes were proposed in this pull request?
Move the checking of GROUP BY column in correlated scalar subquery from
CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.
This problem can be reproduced with a simple script now.
Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c
c1 where c1.ck = p.pk)").show
The requirements are:
1. We need to reference the same table twice in both the parent and the
subquery. Here is the table c.
2. We need to have a correlated predicate but to a different table. Here is
from c (as c1) in the subquery to p in the parent.
3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at
`Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and
the original group by column `ck#<n1>` by their canonicalized form, which is
#<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.
## How was this patch tested?
SubquerySuite and a simplified version of TPCDS-Q32
Author: Nattavut Sutyanyong <[email protected]>
Closes #16246 from nsyca/18814.
(cherry picked from commit cccd64393ea633e29d4a505fb0a7c01b51a79af8)
Signed-off-by: Herman van Hovell <[email protected]>
commit 16d4bd4a25e70e9396b3451a53157f7cc41c1359
Author: Cheng Lian <[email protected]>
Date: 2016-12-14T18:57:03Z
[SPARK-18730] Post Jenkins test report page instead of the full console
output page to GitHub
## What changes were proposed in this pull request?
Currently, the full console output page of a Spark Jenkins PR build can be
as large as several megabytes. It takes a relatively long time to load and may
even freeze the browser for quite a while.
This PR makes the build script to post the test report page link to GitHub
instead. The test report page is way more concise and is usually the first page
I'd like to check when investigating a Jenkins build failure.
Note that for builds that a test report is not available (ongoing builds
and builds that fail before test execution), the test report link automatically
redirects to the build page.
## How was this patch tested?
N/A.
Author: Cheng Lian <[email protected]>
Closes #16163 from liancheng/jenkins-test-report.
(cherry picked from commit ba4aab9b85688141d3d0c185165ec7a402c9fbba)
Signed-off-by: Reynold Xin <[email protected]>
commit af12a21ca7145751acdec400134b1bd5c8168f74
Author: hyukjinkwon <[email protected]>
Date: 2016-12-14T19:29:11Z
[SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side
post-filter for FileFormat datasources
## What changes were proposed in this pull request?
Currently, `FileSourceStrategy` does not handle the case when the
pushed-down filter is `Literal(null)` and removes it at the post-filter in
Spark-side.
For example, the codes below:
```scala
val df = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDF()
df.filter($"_1" === "true").explain(true)
```
shows it keeps `null` properly.
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- LocalRelation [_1#17]
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#17 as double) = cast(true as double))
+- LocalRelation [_1#17]
== Optimized Logical Plan ==
Filter (isnotnull(_1#17) && null)
+- LocalRelation [_1#17]
== Physical Plan ==
*Filter (isnotnull(_1#17) && null) << Here `null` is there
+- LocalTableScan [_1#17]
```
However, when we read it back from Parquet,
```scala
val path = "/tmp/testfile"
df.write.parquet(path)
spark.read.parquet(path).filter($"_1" === "true").explain(true)
```
`null` is removed at the post-filter.
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#11] parquet
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#11 as double) = cast(true as double))
+- Relation[_1#11] parquet
== Optimized Logical Plan ==
Filter (isnotnull(_1#11) && null)
+- Relation[_1#11] parquet
== Physical Plan ==
*Project [_1#11]
+- *Filter isnotnull(_1#11) << Here `null` is missing
+- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat,
Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null],
PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
```
This PR fixes it to keep it properly. In more details,
```scala
val partitionKeyFilters =
ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
```
This keeps this `null` in `partitionKeyFilters` as `Literal` always don't
have `children` and `references` is being empty which is always the subset of
`partitionSet`.
And then in
```scala
val afterScanFilters = filterSet -- partitionKeyFilters
```
`null` is always removed from the post filter. So, if the referenced fields
are empty, it should be applied into data columns too.
After this PR, it becomes as below:
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#276] parquet
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#276 as double) = cast(true as double))
+- Relation[_1#276] parquet
== Optimized Logical Plan ==
Filter (isnotnull(_1#276) && null)
+- Relation[_1#276] parquet
== Physical Plan ==
*Project [_1#276]
+- *Filter (isnotnull(_1#276) && null)
+- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat,
Location:
InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b...,
PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema:
struct<_1:boolean>
```
## How was this patch tested?
Unit test in `FileSourceStrategySuite`
Author: hyukjinkwon <[email protected]>
Closes #16184 from HyukjinKwon/SPARK-18753.
(cherry picked from commit 89ae26dcdb73266fbc3a8b6da9f5dff30dc4ec95)
Signed-off-by: Cheng Lian <[email protected]>
commit e8866f9fc62095b78421d461549f7eaf8e9070b3
Author: Reynold Xin <[email protected]>
Date: 2016-12-14T20:22:49Z
[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating
statistics
## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and
maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an
array of map), 100 * 100 would be used as the default size. This sounds like
just an overestimation which doesn't seem that bad (since it is usually better
to overestimate than underestimate). However, due to the way we assume the size
output for Project (new estimated column size / old estimated column size),
this overestimation can become underestimation. It is actually in general in
this case safer to assume 1 default element.
## How was this patch tested?
This should be covered by existing tests.
Author: Reynold Xin <[email protected]>
Closes #16274 from rxin/SPARK-18853.
(cherry picked from commit 5d799473696a15fddd54ec71a93b6f8cb169810c)
Signed-off-by: Herman van Hovell <[email protected]>
commit c4de90fc76d5aa5d2c8fee4ed692d4ab922cbab0
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-14T21:36:41Z
[SPARK-18852][SS] StreamingQuery.lastProgress should be null when
recentProgress is empty
## What changes were proposed in this pull request?
Right now `StreamingQuery.lastProgress` throws NoSuchElementException and
it's hard to be used in Python since Python user will just see Py4jError.
This PR just makes it return null instead.
## How was this patch tested?
`test("lastProgress should be null when recentProgress is empty")`
Author: Shixiong Zhu <[email protected]>
Closes #16273 from zsxwing/SPARK-18852.
(cherry picked from commit 1ac6567bdb03d7cc5c5f3473827a102280cb1030)
Signed-off-by: Shixiong Zhu <[email protected]>
commit d0d9c5725774897703f2611484838ec7ed09e84f
Author: Joseph K. Bradley <[email protected]>
Date: 2016-12-14T22:10:40Z
[SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes
## What changes were proposed in this pull request?
Added short section for KSTest.
Also added logreg model to list of ML models in vignette. (This will be
reorganized under SPARK-18849)

## How was this patch tested?
Manually tested example locally.
Built vignettes locally.
Author: Joseph K. Bradley <[email protected]>
Closes #16283 from jkbradley/ksTest-vignette.
(cherry picked from commit 78627425708a0afbe113efdf449e8622b43b652d)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 280c35af97a20b15578c14b20aa8c19d8fe75456
Author: Reynold Xin <[email protected]>
Date: 2016-12-15T00:12:14Z
[SPARK-18854][SQL] numberedTreeString and apply(i) inconsistent for
subqueries
## What changes were proposed in this pull request?
This is a bug introduced by subquery handling. numberedTreeString (which
uses generateTreeString under the hood) numbers trees including innerChildren
(used to print subqueries), but apply (which uses getNodeNumbered) ignores
innerChildren. As a result, apply(i) would return the wrong plan node if there
are subqueries.
This patch fixes the bug.
## How was this patch tested?
Added a test case in SubquerySuite.scala to test both the depth-first
traversal of numbering as well as making sure the two methods are consistent.
Author: Reynold Xin <[email protected]>
Closes #16277 from rxin/SPARK-18854.
(cherry picked from commit ffdd1fcd1e8f4f6453d5b0517c0ce82766b8e75f)
Signed-off-by: Reynold Xin <[email protected]>
commit 0d94201e0102fd5890ba07da6dd518cec7334b2b
Author: [email protected] <[email protected]>
Date: 2016-12-15T01:07:27Z
[SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates
## What changes were proposed in this pull request?
When do the QA work, I found that the following issues:
1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.
I also changed the `spark.logit` regParam in the examples, as we discussed
in #16222.
## How was this patch tested?
Manual test
Author: [email protected] <[email protected]>
Closes #16284 from wangmiao1981/ks.
(cherry picked from commit 324388531648de20ee61bd42518a068d4789925c)
Signed-off-by: Felix Cheung <[email protected]>
commit cb2c8428df0607cfbb17a2c874f8228561a2e8ef
Author: Wenchen Fan <[email protected]>
Date: 2016-12-15T05:03:56Z
[SPARK-18856][SQL] non-empty partitioned table should not report zero size
## What changes were proposed in this pull request?
In `DataSource`, if the table is not analyzed, we will use 0 as the default
value for table size. This is dangerous, we may broadcast a large table and
cause OOM. We should use `defaultSizeInBytes` instead.
## How was this patch tested?
new regression test
Author: Wenchen Fan <[email protected]>
Closes #16280 from cloud-fan/bug.
(cherry picked from commit d6f11a12a146a863553c5a5e2023d79d4375ef3f)
Signed-off-by: Reynold Xin <[email protected]>
commit b14fc391893468e25de1e24d982d6f260cac59ad
Author: Reynold Xin <[email protected]>
Date: 2016-12-15T05:08:45Z
[SPARK-18869][SQL] Add TreeNode.p that returns BaseType
## What changes were proposed in this pull request?
After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_]
rather than a more specific type. It would be easier for interactive debugging
to introduce a function that returns the BaseType.
## How was this patch tested?
N/A - this is a developer only feature used for interactive debugging. As
long as it compiles, it should be good to go. I tested this in spark-shell.
Author: Reynold Xin <[email protected]>
Closes #16288 from rxin/SPARK-18869.
(cherry picked from commit 5d510c693aca8c3fd3364b4453160bc8585ffc8e)
Signed-off-by: Reynold Xin <[email protected]>
commit d399a297d1ec9e0a3c57658cba0320b4d7fe88c5
Author: Dongjoon Hyun <[email protected]>
Date: 2016-12-15T05:29:20Z
[SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding
`DESCRIPTION` file
## What changes were proposed in this pull request?
Since Apache Spark 1.4.0, R API document page has a broken link on
`DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR
aims to fix that.
- Official Latest Website:
http://spark.apache.org/docs/latest/api/R/index.html
- Apache Spark 2.1.0-rc2:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html
## How was this patch tested?
Manual.
```bash
cd docs
SKIP_SCALADOC=1 jekyll build
```
Author: Dongjoon Hyun <[email protected]>
Closes #16292 from dongjoon-hyun/SPARK-18875.
(cherry picked from commit ec0eae486331c3977505d261676b77a33c334216)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit 2a8de2e11ebab0cb9056444053127619d8a47d8a
Author: Felix Cheung <[email protected]>
Date: 2016-12-15T05:51:52Z
[SPARK-18849][ML][SPARKR][DOC] vignettes final check update
## What changes were proposed in this pull request?
doc cleanup
## How was this patch tested?
~~vignettes is not building for me. I'm going to kick off a full clean
build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html
Author: Felix Cheung <[email protected]>
Closes #16286 from felixcheung/rvignettespass.
(cherry picked from commit 7d858bc5ce870a28a559f4e81dcfc54cbd128cb7)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit e430915fad7ffb9397a96f0ef16e741c6b4f158b
Author: Tathagata Das <[email protected]>
Date: 2016-12-15T19:54:35Z
[SPARK-18870] Disallowed Distinct Aggregations on Streaming Datasets
## What changes were proposed in this pull request?
Check whether Aggregation operators on a streaming subplan have aggregate
expressions with isDistinct = true.
## How was this patch tested?
Added unit test
Author: Tathagata Das <[email protected]>
Closes #16289 from tdas/SPARK-18870.
(cherry picked from commit 4f7292c87512a7da3542998d0e5aa21c27a511e9)
Signed-off-by: Tathagata Das <[email protected]>
commit 900ce558a238fb9d8220527d8313646fe6830695
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-15T21:17:51Z
[SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource
## What changes were proposed in this pull request?
When starting a stream with a lot of backfill and maxFilesPerTrigger, the
user could often want to start with most recent files first. This would let you
keep low latency for recent data and slowly backfill historical data.
This PR adds a new option `latestFirst` to control this behavior. When it's
true, `FileStreamSource` will sort the files by the modified time from latest
to oldest, and take the first `maxFilesPerTrigger` files as a new batch.
## How was this patch tested?
The added test.
Author: Shixiong Zhu <[email protected]>
Closes #16251 from zsxwing/newest-first.
(cherry picked from commit 68a6dc974b25e6eddef109f6fd23ae4e9775ceca)
Signed-off-by: Tathagata Das <[email protected]>
commit b6a81f4720752efe459860d28d7f8f738b2944c3
Author: Burak Yavuz <[email protected]>
Date: 2016-12-15T22:26:54Z
[SPARK-18888] partitionBy in DataStreamWriter in Python throws _to_seq not
defined
## What changes were proposed in this pull request?
`_to_seq` wasn't imported.
## How was this patch tested?
Added partitionBy to existing write path unit test
Author: Burak Yavuz <[email protected]>
Closes #16297 from brkyvz/SPARK-18888.
commit ef2ccf94224f00154cab7ab173d65442ecd389d7
Author: Patrick Wendell <[email protected]>
Date: 2016-12-15T22:46:00Z
Preparing Spark release v2.1.0-rc3
commit a7364a82eb0d18f92f1d8e46c1160a55bc250032
Author: Patrick Wendell <[email protected]>
Date: 2016-12-15T22:46:09Z
Preparing development version 2.1.1-SNAPSHOT
commit 08e4272872fc17c43f0dc79d329b946e8e85694d
Author: Burak Yavuz <[email protected]>
Date: 2016-12-15T23:46:03Z
[SPARK-18868][FLAKY-TEST] Deflake StreamingQueryListenerSuite: single
listener, check trigger...
## What changes were proposed in this pull request?
Use `recentProgress` instead of `lastProgress` and filter out last non-zero
value. Also add eventually to the latest assertQuery similar to first
`assertQuery`
## How was this patch tested?
Ran test 1000 times
Author: Burak Yavuz <[email protected]>
Closes #16287 from brkyvz/SPARK-18868.
(cherry picked from commit 9c7f83b0289ba4550b156e6af31cf7c44580eb12)
Signed-off-by: Shixiong Zhu <[email protected]>
commit ae853e8f3bdbd16427e6f1ffade4f63abaf74abb
Author: Shivaram Venkataraman <[email protected]>
Date: 2016-12-16T00:15:51Z
[MINOR] Only rename SparkR tar.gz if names mismatch
## What changes were proposed in this pull request?
For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g.,
2.1.0). Thus `cp` throws an error which causes the build to fail.
## How was this patch tested?
Manually by executing the following script
```
set -o pipefail
set -e
set -x
touch a
R_PACKAGE_VERSION=2.1.0
VERSION=2.1.0
if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then
cp a a
fi
```
Author: Shivaram Venkataraman <[email protected]>
Closes #16299 from shivaram/sparkr-cp-fix.
(cherry picked from commit 9634018c4d6d5a4f2c909f7227d91e637107b7f4)
Signed-off-by: Reynold Xin <[email protected]>
commit ec31726581a43624fd47ce48f4e33d2a8e96c15c
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T00:18:20Z
Preparing Spark release v2.1.0-rc4
commit 62a6577bfa3a83783c813e74286e62b668e9af83
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T00:18:29Z
Preparing development version 2.1.1-SNAPSHOT
commit b23220fa67dd279d0b8005cb66d0875adbd3c8cb
Author: Shivaram Venkataraman <[email protected]>
Date: 2016-12-16T01:13:35Z
[MINOR] Handle fact that mv is different on linux, mac
Follow up to
https://github.com/apache/spark/commit/ae853e8f3bdbd16427e6f1ffade4f63abaf74abb
as `mv` throws an error on the Jenkins machines if source and destinations are
the same.
Author: Shivaram Venkataraman <[email protected]>
Closes #16302 from shivaram/sparkr-no-mv-fix.
(cherry picked from commit 5a44f18a2a114bdd37b6714d81f88cb68148f0c9)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit cd0a08361e2526519e7c131c42116bf56fa62c76
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T01:57:04Z
Preparing Spark release v2.1.0-rc5
commit 483624c2e13c8f239ee750bc149941b79800d0b0
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T01:57:11Z
Preparing development version 2.1.1-SNAPSHOT
commit d8548c8a7541bfa37761382edbb1892a145b2b71
Author: Reynold Xin <[email protected]>
Date: 2016-12-16T05:58:27Z
[SPARK-18892][SQL] Alias percentile_approx approx_percentile
## What changes were proposed in this pull request?
percentile_approx is the name used in Hive, and approx_percentile is the
name used in Presto. approx_percentile is actually more consistent with our
approx_count_distinct. Given the cost to alias SQL functions is low
(one-liner), it'd be better to just alias them so it is easier to use.
## How was this patch tested?
Technically I could add an end-to-end test to verify this one-line change,
but it seemed too trivial to me.
Author: Reynold Xin <[email protected]>
Closes #16300 from rxin/SPARK-18892.
(cherry picked from commit 172a52f5d31337d90155feb7072381e8d5712288)
Signed-off-by: Reynold Xin <[email protected]>
commit a73201dafcf22756b8074a73e1b5da41cdf8b9a4
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-16T08:42:39Z
[SPARK-18850][SS] Make StreamExecution and progress classes serializable
## What changes were proposed in this pull request?
This PR adds StreamingQueryWrapper to make StreamExecution and progress
classes serializable because it is too easy for it to get captured with normal
usage. If StreamingQueryWrapper gets captured in a closure but no place calls
its methods, it should not fail the Spark tasks. However if its methods are
called, then this PR will throw a better message.
## How was this patch tested?
`test("StreamingQuery should be Serializable but cannot be used in
executors")`
`test("progress classes should be Serializable")`
Author: Shixiong Zhu <[email protected]>
Closes #16272 from zsxwing/SPARK-18850.
(cherry picked from commit d7f3058e17571d76a8b4c8932de6de81ce8d2e78)
Signed-off-by: Tathagata Das <[email protected]>
commit d8ef0be83d8d032ddab79b465226ed3ff3d1eff7
Author: Takeshi YAMAMURO <[email protected]>
Date: 2016-12-16T14:44:42Z
[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet
reader fail to read data
## What changes were proposed in this pull request?
A vectorized parquet reader fails to read column data if data schema and
partition schema overlap with each other and inferred types in the partition
schema differ from ones in the data schema. An example code to reproduce this
bug is as follows;
```
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
scala> val df = spark.read.parquet("/data/")
scala> df.printSchema
root
|-- a: long (nullable = true)
|-- b: integer (nullable = true)
scala> df.collect
java.lang.NullPointerException
at
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
at
org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
The root cause is that a logical layer (`HadoopFsRelation`) and a physical
layer (`VectorizedParquetRecordReader`) have a different assumption on
partition schema; the logical layer trusts the data schema to infer the type
the overlapped partition columns, and, on the other hand, the physical layer
trusts partition schema which is inferred from path string. To fix this bug,
this pr simply updates `HadoopFsRelation.schema` to respect the partition
columns position in data schema and respect the partition columns type in
partition schema.
## How was this patch tested?
Add tests in `ParquetPartitionDiscoverySuite`
Author: Takeshi YAMAMURO <[email protected]>
Closes #16030 from maropu/SPARK-18108.
(cherry picked from commit dc2a4d4ad478fdb0486cc0515d4fe8b402d24db4)
Signed-off-by: Wenchen Fan <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]