GitHub user Charele opened a pull request:
https://github.com/apache/spark/pull/23107
small question in Spillable class
Sorry my english skill,
I just only want to desc my question, I think should have a "Issues" button
here.
In org.apache.spark.util.collection.Spillable,
code:
private[this] var _elementsRead = 0
... ...
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
The default value of numElementsForceSpillThreshold is Integer.MAX_VALUE,
however, the _elementsRead is a Int type, I think the _elementsRead should
a Long type, is it?
private[this] var _elementsRead: Long = 0
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.4
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23107.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23107
----
commit 872bad161f1dbe6acd89b75f60053bfc8b621687
Author: Dilip Biswal <dbiswal@...>
Date: 2018-09-07T06:35:02Z
[SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation in the test cases
of sql/core and sql/hive
## What changes were proposed in this pull request?
In SharedSparkSession and TestHive, we need to disable the rule
ConvertToLocalRelation for better test case coverage.
## How was this patch tested?
Identify the failures after excluding "ConvertToLocalRelation" rule.
Closes #22270 from dilipbiswal/SPARK-25267-final.
Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 6d7bc5af454341f6d9bfc1e903148ad7ba8de6f9)
Signed-off-by: gatorsmile <[email protected]>
commit 95a48b909d103e59602e883d472cb03c7c434168
Author: fjh100456 <fu.jinhua6@...>
Date: 2018-09-07T16:28:33Z
[SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS
## What changes were proposed in this pull request?
Before Apache Spark 2.3, table properties were ignored when writing data to
a hive table(created with STORED AS PARQUET/ORC syntax), because the
compression configurations were not passed to the FileFormatWriter in
hadoopConf. Then it was fixed in #20087. But actually for CTAS with USING
PARQUET/ORC syntax, table properties were ignored too when convertMastore, so
the test case for CTAS not supported.
Now it has been fixed in #20522 , the test case should be enabled too.
## How was this patch tested?
This only re-enables the test cases of previous PR.
Closes #22302 from fjh100456/compressionCodec.
Authored-by: fjh100456 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 473f2fb3bfd0e51c40a87e475392f2e2c8f912dd)
Signed-off-by: Dongjoon Hyun <[email protected]>
commit 80567fad4e3d8d4573d4095b1e460452e597d81f
Author: Lee Dongjin <dongjin@...>
Date: 2018-09-07T17:36:15Z
[MINOR][SS] Fix kafka-0-10-sql trivials
## What changes were proposed in this pull request?
Fix unused imports & outdated comments on `kafka-0-10-sql` module. (Found
while I was working on
[SPARK-23539](https://github.com/apache/spark/pull/22282))
## How was this patch tested?
Existing unit tests.
Closes #22342 from dongjinleekr/feature/fix-kafka-sql-trivials.
Authored-by: Lee Dongjin <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 458f5011bd52851632c3592ac35f1573bc904d50)
Signed-off-by: Sean Owen <[email protected]>
commit 904192ad18ff09cc5874e09b03447dd5f7754963
Author: WeichenXu <weichen.xu@...>
Date: 2018-09-08T16:09:14Z
[SPARK-25345][ML] Deprecate public APIs from ImageSchema
## What changes were proposed in this pull request?
Deprecate public APIs from ImageSchema.
## How was this patch tested?
N/A
Closes #22349 from WeichenXu123/image_api_deprecate.
Authored-by: WeichenXu <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
(cherry picked from commit 08c02e637ac601df2fe890b8b5a7a049bdb4541b)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 8f7d8a0977647dc96ab9259d306555bbe1c32873
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-09-08T17:21:55Z
[SPARK-25375][SQL][TEST] Reenable qualified perm. function checks in
UDFSuite
## What changes were proposed in this pull request?
At Spark 2.0.0, SPARK-14335 adds some [commented-out test
coverages](https://github.com/apache/spark/pull/12117/files#diff-dd4b39a56fac28b1ced6184453a47358R177
). This PR enables them because it's supported since 2.0.0.
## How was this patch tested?
Pass the Jenkins with re-enabled test coverage.
Closes #22363 from dongjoon-hyun/SPARK-25375.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 26f74b7cb16869079aa7b60577ac05707101ee68)
Signed-off-by: gatorsmile <[email protected]>
commit a00a160e1e63ef2aaf3eaeebf2a3e5a5eb05d076
Author: gatorsmile <gatorsmile@...>
Date: 2018-09-09T13:25:19Z
Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-25317]
## What changes were proposed in this pull request?
When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw
more than 10% performance regression on the following queries: q67, q24a and
q24b. After we applying the PR https://github.com/apache/spark/pull/22338, the
performance regression still exists. If we revert the changes in
https://github.com/apache/spark/pull/19222, npoggi and winglungngai found the
performance regression was resolved. Thus, this PR is to revert the related
changes for unblocking the 2.4 release.
In the future release, we still can continue the investigation and find out
the root cause of the regression.
## How was this patch tested?
The existing test cases
Closes #22361 from gatorsmile/revertMemoryBlock.
Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0b9ccd55c2986957863dcad3b44ce80403eecfa1)
Signed-off-by: Wenchen Fan <[email protected]>
commit 6b7ea78aec73b8f24c2e1161254edd5ebb6c82bf
Author: WeichenXu <weichen.xu@...>
Date: 2018-09-09T14:49:13Z
[MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeasure` method
## What changes were proposed in this pull request?
Remove `BisectingKMeansModel.setDistanceMeasure` method.
In `BisectingKMeansModel` set this param is meaningless.
## How was this patch tested?
N/A
Closes #22360 from WeichenXu123/bkmeans_update.
Authored-by: WeichenXu <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 88a930dfab56c15df02c7bb944444745c2921fa5)
Signed-off-by: Sean Owen <[email protected]>
commit c1c1bda3cecd82a926526e5e5ee24d9909cb7e49
Author: Yuming Wang <yumwang@...>
Date: 2018-09-09T16:07:31Z
[SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result
## What changes were proposed in this pull request?
How to reproduce:
```scala
val df1 = spark.createDataFrame(Seq(
(1, 1)
)).toDF("a", "b").withColumn("c", lit(null).cast("int"))
val df2 = df1.union(df1).withColumn("d",
spark_partition_id).filter($"c".isNotNull)
df2.show
+---+---+----+---+
| a| b| c| d|
+---+---+----+---+
| 1| 1|null| 0|
| 1| 1|null| 1|
+---+---+----+---+
```
`filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before
https://github.com/apache/spark/pull/19201, but it is transformed to `(c#10 =
null)` since https://github.com/apache/spark/pull/20155. This pr revert it to
`(null <=> c#10)` to fix this issue.
## How was this patch tested?
unit tests
Closes #22368 from wangyum/SPARK-25368.
Authored-by: Yuming Wang <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 77c996403d5c761f0dfea64c5b1cb7480ba1d3ac)
Signed-off-by: gatorsmile <[email protected]>
commit 0782dfa14c524131c04320e26d2b607777fe3b06
Author: seancxmao <seancxmao@...>
Date: 2018-09-10T02:22:47Z
[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for
ORC native data source table persisted in metastore
## What changes were proposed in this pull request?
Apache Spark doesn't create Hive table with duplicated fields in both
case-sensitive and case-insensitive mode. However, if Spark creates ORC files
in case-sensitive mode first and create Hive table on that location, where it's
created. In this situation, field resolution should fail in case-insensitive
mode. Otherwise, we don't know which columns will be returned or filtered.
Previously, SPARK-25132 fixed the same issue in Parquet.
Here is a simple example:
```
val data = spark.range(5).selectExpr("id as a", "id * 2 as A")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")
sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION
'/user/hive/warehouse/orc_data'")
spark.conf.set("spark.sql.caseSensitive", false)
sql("select A from orc_data_source").show
+---+
| A|
+---+
| 3|
| 2|
| 4|
| 1|
| 0|
+---+
```
See #22148 for more details about parquet data source reader.
## How was this patch tested?
Unit tests added.
Closes #22262 from seancxmao/SPARK-25175.
Authored-by: seancxmao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a0aed475c54079665a8e5c5cd53a2e990a4f47b4)
Signed-off-by: Dongjoon Hyun <[email protected]>
commit c9ca3594345610148ef5d993262d3090d5b2c658
Author: Yuming Wang <yumwang@...>
Date: 2018-09-10T05:47:19Z
[SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirCommand output schema in
Parquet issue
## What changes were proposed in this pull request?
How to reproduce:
```scala
spark.sql("CREATE TABLE tbl(id long)")
spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet' " +
"STORED AS PARQUET SELECT ID FROM view1")
spark.read.parquet("/tmp/spark/parquet").schema
scala> spark.read.parquet("/tmp/spark/parquet").schema
res10: org.apache.spark.sql.types.StructType =
StructType(StructField(id,LongType,true))
```
The schema should be `StructType(StructField(ID,LongType,true))` as we
`SELECT ID FROM view1`.
This pr fix this issue.
## How was this patch tested?
unit tests
Closes #22359 from wangyum/SPARK-25313-FOLLOW-UP.
Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f8b4d5aafd1923d9524415601469f8749b3d0811)
Signed-off-by: Wenchen Fan <[email protected]>
commit 67bc7ef7b70b6b654433bd5e56cff2f5ec6ae9bd
Author: gatorsmile <gatorsmile@...>
Date: 2018-09-10T11:18:00Z
[SPARK-24849][SPARK-24911][SQL][FOLLOW-UP] Converting a value of StructType
to a DDL string
## What changes were proposed in this pull request?
Add the version number for the new APIs.
## How was this patch tested?
N/A
Closes #22377 from gatorsmile/followup24849.
Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6f6517837ba9934a280b11aba9d9be58bc131f25)
Signed-off-by: Wenchen Fan <[email protected]>
commit 5d98c31941471bdcdc54a68f55ddaaab48f82161
Author: Marco Gaido <marcogaido91@...>
Date: 2018-09-10T11:41:51Z
[SPARK-25278][SQL] Avoid duplicated Exec nodes when the same logical plan
appears in the query
## What changes were proposed in this pull request?
In the Planner, we collect the placeholder which need to be substituted in
the query execution plan and once we plan them, we substitute the placeholder
with the effective plan.
In this second phase, we rely on the `==` comparison, ie. the `equals`
method. This means that if two placeholder plans - which are different
instances - have the same attributes (so that they are equal, according to the
equal method) they are both substituted with their corresponding new physical
plans. So, in such a situation, the first time we substitute both them with the
first of the 2 new generated plan and the second time we substitute nothing.
This is usually of no harm for the execution of the query itself, as the 2
plans are identical. But since they are the same instance, now, the local
variables are shared (which is unexpected). This causes issues for the metrics
collected, as the same node is executed 2 times, so the metrics are accumulated
2 times, wrongly.
The PR proposes to use the `eq` method in checking which placeholder needs
to be substituted,; thus in the previous situation, actually both the two
different physical nodes which are created (one for each time the logical plan
appears in the query plan) are used and the metrics are collected properly for
each of them.
## How was this patch tested?
added UT
Closes #22284 from mgaido91/SPARK-25278.
Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 12e3e9f17dca11a2cddf0fb99d72b4b97517fb56)
Signed-off-by: Wenchen Fan <[email protected]>
commit ffd036a6d13814ebcc332990be1e286939cc6abe
Author: Holden Karau <holden@...>
Date: 2018-09-10T18:01:51Z
[SPARK-23672][PYTHON] Document support for nested return types in scalar
with arrow udfs
## What changes were proposed in this pull request?
Clarify docstring for Scalar functions
## How was this patch tested?
Adds a unit test showing use similar to wordcount, there's existing unit
test for array of floats as well.
Closes #20908 from
holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs.
Authored-by: Holden Karau <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
(cherry picked from commit da5685b5bb9ee7daaeb4e8f99c488ebd50c7aac3)
Signed-off-by: Bryan Cutler <[email protected]>
commit fb4965a41941f3a196de77a870a8a1f29c96dac0
Author: Marco Gaido <marcogaido91@...>
Date: 2018-09-11T06:16:56Z
[SPARK-25371][SQL] struct() should allow being called with 0 args
## What changes were proposed in this pull request?
SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be
non-empty. This means that `struct()`, which was previously considered valid,
now throws an Exception. This behavior change was introduced in 2.3.0. The
change may break users' application on upgrade and it causes `VectorAssembler`
to fail when an empty `inputCols` is defined.
The PR removes the added check making `struct()` valid again.
## How was this patch tested?
added UT
Closes #22373 from mgaido91/SPARK-25371.
Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0736e72a66735664b191fc363f54e3c522697dba)
Signed-off-by: Wenchen Fan <[email protected]>
commit b7efca7ece484ee85091b1b50bbc84ad779f9bfe
Author: Mario Molina <mmolimar@...>
Date: 2018-09-11T12:47:14Z
[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as
null when nullValue is set.
## What changes were proposed in this pull request?
In the PR, I propose new CSV option `emptyValue` and an update in the SQL
Migration Guide which describes how to revert previous behavior when empty
strings were not written at all. Since Spark 2.4, empty strings are saved as
`""` to distinguish them from saved `null`s.
Closes #22234
Closes #22367
## How was this patch tested?
It was tested by `CSVSuite` and new tests added in the PR #22234
Closes #22389 from MaxGekk/csv-empty-value-master.
Lead-authored-by: Mario Molina <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
(cherry picked from commit c9cb393dc414ae98093c1541d09fa3c8663ce276)
Signed-off-by: hyukjinkwon <[email protected]>
commit 0b8bfbe12b8a368836d7ddc8445de18b7ee42cde
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-09-11T15:57:42Z
[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent
duplicate fields
## What changes were proposed in this pull request?
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY
STORED AS` should not generate files with duplicate fields because Spark cannot
read those files back.
**INSERT OVERWRITE DIRECTORY USING**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet
SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when
inserting into file:/tmp/parquet: `id`;
```
**INSERT OVERWRITE DIRECTORY STORED AS**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS
parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data
schema and the partition schema: `id`;
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
Closes #22378 from dongjoon-hyun/SPARK-25389.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 77579aa8c35b0d98bbeac3c828bf68a1d190d13e)
Signed-off-by: Dongjoon Hyun <[email protected]>
commit 4414e026097c74aadd252b541c9d3009cd7e9d09
Author: Gera Shegalov <gera@...>
Date: 2018-09-11T16:28:32Z
[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf
values
## What changes were proposed in this pull request?
Stop trimming values of properties loaded from a file
## How was this patch tested?
Added unit test demonstrating the issue hit in production.
Closes #22213 from gerashegalov/gera/SPARK-25221.
Authored-by: Gera Shegalov <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit bcb9a8c83f4e6835af5dc51f1be7f964b8fa49a3)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 16127e844f8334e1152b2e3ed3d878ec8de13dfa
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-09-11T17:31:06Z
[SPARK-24889][CORE] Update block info when unpersist rdds
## What changes were proposed in this pull request?
We will update block info coming from executors, at the timing like caching
a RDD. However, when removing RDDs with unpersisting, we don't ask to update
block info. So the block info is not updated.
We can fix this with few options:
1. Ask to update block info when unpersisting
This is simplest but changes driver-executor communication a bit.
2. Update block info when processing the event of unpersisting RDD
We send a `SparkListenerUnpersistRDD` event when unpersisting RDD. When
processing this event, we can update block info of the RDD. This only changes
event processing code so the risk seems to be lower.
Currently this patch takes option 2 for lower risk. If we agree first
option has no risk, we can change to it.
## How was this patch tested?
Unit tests.
Closes #22341 from viirya/SPARK-24889.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit 14f3ad20932535fe952428bf255e7eddd8fa1b58)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 99b37a91871f8bf070d43080f1c58475548c99fd
Author: Sean Owen <sean.owen@...>
Date: 2018-09-11T19:46:03Z
[SPARK-25398] Minor bugs from comparing unrelated types
## What changes were proposed in this pull request?
Correct some comparisons between unrelated types to what they seem toâ¦
have been trying to do
## How was this patch tested?
Existing tests.
Closes #22384 from srowen/SPARK-25398.
Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit cfbdd6a1f5906b848c520d3365cc4034992215d9)
Signed-off-by: Sean Owen <[email protected]>
commit 3a6ef8b7e2d17fe22458bfd249f45b5a5ce269ec
Author: Sean Owen <sean.owen@...>
Date: 2018-09-11T19:52:58Z
Revert "[SPARK-23820][CORE] Enable use of long form of callsite in logs"
This reverts commit e58dadb77ed6cac3e1b2a037a6449e5a6e7f2cec.
commit 0dbf1450f7965c27ce9329c7dad351ff8b8072dc
Author: Mukul Murthy <mukul.murthy@...>
Date: 2018-09-11T22:53:15Z
[SPARK-25399][SS] Continuous processing state should not affect microbatch
execution jobs
## What changes were proposed in this pull request?
The leftover state from running a continuous processing streaming job
should not affect later microbatch execution jobs. If a continuous processing
job runs and the same thread gets reused for a microbatch execution job in the
same environment, the microbatch job could get wrong answers because it can
attempt to load the wrong version of the state.
## How was this patch tested?
New and existing unit tests
Closes #22386 from mukulmurthy/25399-streamthread.
Authored-by: Mukul Murthy <[email protected]>
Signed-off-by: Tathagata Das <[email protected]>
(cherry picked from commit 9f5c5b4cca7d4eaa30a3f8adb4cb1eebe3f77c7a)
Signed-off-by: Tathagata Das <[email protected]>
commit 40e4db0eb72be7640bd8b5b319ad4ba99c9dc846
Author: gatorsmile <gatorsmile@...>
Date: 2018-09-12T13:11:22Z
[SPARK-25402][SQL] Null handling in BooleanSimplification
## What changes were proposed in this pull request?
This PR is to fix the null handling in BooleanSimplification. In the rule
BooleanSimplification, there are two cases that do not properly handle null
values. The optimization is not right if either side is null. This PR is to fix
them.
## How was this patch tested?
Added test cases
Closes #22390 from gatorsmile/fixBooleanSimplification.
Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 79cc59718fdf7785bdc37a26bb8df4c6151114a6)
Signed-off-by: Wenchen Fan <[email protected]>
commit 071babbab5a49b7106d61b0c9a18672bd67e1786
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-09-12T14:54:05Z
[SPARK-25352][SQL] Perform ordered global limit when limit number is bigger
than topKSortFallbackThreshold
## What changes were proposed in this pull request?
We have optimization on global limit to evenly distribute limit rows across
all partitions. This optimization doesn't work for ordered results.
For a query ending with sort + limit, in most cases it is performed by
`TakeOrderedAndProjectExec`.
But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`,
global limit will be used. At this moment, we need to do ordered global limit.
## How was this patch tested?
Unit tests.
Closes #22344 from viirya/SPARK-25352.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2f422398b524eacc89ab58e423bb134ae3ca3941)
Signed-off-by: Wenchen Fan <[email protected]>
commit 4c1428fa2b29c371458977427561d2b4bb9daa5b
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-09-12T17:43:40Z
[SPARK-25363][SQL] Fix schema pruning in where clause by ignoring
unnecessary root fields
## What changes were proposed in this pull request?
Schema pruning doesn't work if nested column is used in where clause.
For example,
```
sql("select name.first from contacts where name.first = 'David'")
== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
+- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet,
PartitionFilters: [],
PushedFilters: [IsNotNull(name)], ReadSchema:
struct<name:struct<first:string,middle:string,last:string>>
```
In above query plan, the scan node reads the entire schema of `name` column.
This issue is reported by:
https://github.com/apache/spark/pull/21320#issuecomment-419290197
The cause is that we infer a root field from expression `IsNotNull(name)`.
However, for such expression, we don't really use the nested fields of this
root field, so we can ignore the unnecessary nested fields.
## How was this patch tested?
Unit tests.
Closes #22357 from viirya/SPARK-25363.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 3030b82c89d3e45a2e361c469fbc667a1e43b854)
Signed-off-by: DB Tsai <[email protected]>
commit 15d2e9d7d2f0d5ecefd69bdc3f8a149670b05e79
Author: Wenchen Fan <wenchen@...>
Date: 2018-09-12T18:25:24Z
[SPARK-24882][SQL] Revert [] improve data source v2 API from branch 2.4
## What changes were proposed in this pull request?
As discussed in the dev list, we don't want to include
https://github.com/apache/spark/pull/22009 in Spark 2.4, as it needs data
source v2 users to change the implementation intensitively, while they need to
change again in next release.
## How was this patch tested?
existing tests
Author: Wenchen Fan <[email protected]>
Closes #22388 from cloud-fan/revert.
commit 71f70130f1b2b4ec70595627f0a02a88e2c0e27d
Author: Michael Mior <mmior@...>
Date: 2018-09-13T01:45:25Z
[SPARK-23820][CORE] Enable use of long form of callsite in logs
This is a rework of #21433 to address some concerns there.
Closes #22398 from michaelmior/long-callsite2.
Authored-by: Michael Mior <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ab25c967905ca0973fc2f30b8523246bb9244206)
Signed-off-by: Wenchen Fan <[email protected]>
commit 776dc42c1326764233a4466172330b74b98df7aa
Author: Maxim Gekk <max.gekk@...>
Date: 2018-09-13T01:51:49Z
[SPARK-25387][SQL] Fix for NPE caused by bad CSV input
## What changes were proposed in this pull request?
The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In
some cases, `uniVocity` parser can return `null` for bad input. In the PR, I
propose to check result of parsing and not propagate NPE to upper layers.
## How was this patch tested?
I added a test which reproduce the issue and tested by `CSVSuite`.
Closes #22374 from MaxGekk/npe-on-bad-csv.
Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 083c9447671719e0bd67312e3d572f6160c06a4a)
Signed-off-by: Wenchen Fan <[email protected]>
commit 6f4d647e07ef527ef93c4fc849a478008a52bc80
Author: LantaoJin <jinlantao@...>
Date: 2018-09-13T01:57:34Z
[SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information
like file path to event log
## What changes were proposed in this pull request?
Field metadata removed from SparkPlanInfo in #18600 . Corresponding, many
meta data was also removed from event SparkListenerSQLExecutionStart in Spark
event log. If we want to analyze event log to get all input paths, we couldn't
get them. Instead, simpleString of SparkPlanInfo JSON only display 100
characters, it won't help.
Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log
looks like below (It contains the metadata field which has the intact
information):
>{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart",
Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4...,
"metadata": {"Location":
"InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"}
After #18600, metadata field was removed.
>{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart",
Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4...,
So I add this field back to SparkPlanInfo class. Then it will log out the
meta data to event log. Intact information in event log is very useful for
offline job analysis.
## How was this patch tested?
Unit test
Closes #22353 from LantaoJin/SPARK-25357.
Authored-by: LantaoJin <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6dc5921e66d56885b95c07e56e687f9f6c1eaca7)
Signed-off-by: Wenchen Fan <[email protected]>
commit ae5c7bb204c52dd18cfb63e5c621537023e36539
Author: Sean Owen <sean.owen@...>
Date: 2018-09-13T03:19:43Z
[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
(This change is a subset of the changes needed for the JIRA; see
https://github.com/apache/spark/pull/22231)
## What changes were proposed in this pull request?
Use raw strings and simpler regex syntax consistently in Python, which also
avoids warnings from pycodestyle about accidentally relying Python's
non-escaping of non-reserved chars in normal strings. Also, fix a few long
lines.
## How was this patch tested?
Existing tests, and some manual double-checking of the behavior of regexes
in Python 2/3 to be sure.
Closes #22400 from srowen/SPARK-25238.2.
Authored-by: Sean Owen <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
(cherry picked from commit 08c76b5d39127ae207d9d1fff99c2551e6ce2581)
Signed-off-by: hyukjinkwon <[email protected]>
commit abb5196c7ef685e1027eb1b0b09f4559d3eba015
Author: Stavros Kontopoulos <stavros.kontopoulos@...>
Date: 2018-09-13T05:02:59Z
[SPARK-25295][K8S] Fix executor names collision
## What changes were proposed in this pull request?
Fixes the collision issue with spark executor names in client mode, see
SPARK-25295 for the details.
It follows the cluster name convention as app-name will be used as the
prefix and if that is not defined we use "spark" as the default prefix. Eg.
`spark-pi-1536781360723-exec-1` where spark-pi is the name of the app passed at
the config side or transformed if it contains illegal characters.
Also fixes the issue with spark app name having spaces in cluster mode.
If you run the Spark Pi test in client mode it passes.
The tricky part is the user may set the app name:
https://github.com/apache/spark/blob/3030b82c89d3e45a2e361c469fbc667a1e43b854/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala#L30
If i do:
```
./bin/spark-submit
...
--deploy-mode cluster --name "spark pi"
...
```
it will fail as the app name is used for the prefix of driver's pod name
and it cannot have spaces (according to k8s conventions).
## How was this patch tested?
Manually by running spark job in client mode.
To reproduce do:
```
kubectl create -f service.yaml
kubectl create -f pod.yaml
```
service.yaml :
```
kind: Service
apiVersion: v1
metadata:
name: spark-test-app-1-svc
spec:
clusterIP: None
selector:
spark-app-selector: spark-test-app-1
ports:
- protocol: TCP
name: driver-port
port: 7077
targetPort: 7077
- protocol: TCP
name: block-manager
port: 10000
targetPort: 10000
```
pod.yaml:
```
apiVersion: v1
kind: Pod
metadata:
name: spark-test-app-1
labels:
spark-app-selector: spark-test-app-1
spec:
containers:
- name: spark-test
image: skonto/spark:k8s-client-fix
imagePullPolicy: Always
command:
- 'sh'
- '-c'
- "/opt/spark/bin/spark-submit
--verbose
--master k8s://https://kubernetes.default.svc
--deploy-mode client
--class org.apache.spark.examples.SparkPi
--conf spark.app.name=spark
--conf spark.executor.instances=1
--conf
spark.kubernetes.container.image=skonto/spark:k8s-client-fix
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf
spark.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
--conf
spark.kubernetes.authenticate.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
--conf spark.executor.memory=500m
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.driver.host=spark-test-app-1-svc.default.svc
--conf spark.driver.port=7077
--conf spark.driver.blockManager.port=10000
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar 1000000"
```
Closes #22405 from skonto/fix-k8s-client-mode-executor-names.
Authored-by: Stavros Kontopoulos <[email protected]>
Signed-off-by: Yinan Li <[email protected]>
(cherry picked from commit 3e75a9fa24f8629d068b5fbbc7356ce2603fa58d)
Signed-off-by: Yinan Li <[email protected]>
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]