GitHub user connCN opened a pull request:
https://github.com/apache/spark/pull/19400
How to use REST api to end sparksql job Or use other way ?
In the REST API doc, I can't find the API for kill a job. Sometimes i
want to kill a dead or slow sparksql job.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19400.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19400
----
commit 3f82d65bf6a628b0d46bb2eded9ed12f1d5aa9d2
Author: liuxian <[email protected]>
Date: 2017-05-25T00:32:02Z
[SPARK-20403][SQL] Modify the instructions of some functions
## What changes were proposed in this pull request?
1. add instructions of 'cast' function When using 'show functions'
and 'desc function cast'
command in spark-sql
2. Modify the instructions of functionsï¼such as
booleanï¼tinyintï¼smallintï¼intï¼bigintï¼floatï¼doubleï¼decimalï¼dateï¼timestampï¼binaryï¼string
## How was this patch tested?
Before modificationï¼
spark-sql>desc function boolean;
Function: boolean
Class: org.apache.spark.sql.catalyst.expressions.Cast
Usage: boolean(expr AS type) - Casts the value `expr` to the target data
type `type`.
After modificationï¼
spark-sql> desc function boolean;
Function: boolean
Class: org.apache.spark.sql.catalyst.expressions.Cast
Usage: boolean(expr) - Casts the value `expr` to the target data type
`boolean`.
spark-sql> desc function cast
Function: cast
Class: org.apache.spark.sql.catalyst.expressions.Cast
Usage: cast(expr AS type) - Casts the value `expr` to the target data type
`type`.
Author: liuxian <[email protected]>
Closes #17698 from 10110346/wip_lx_0418.
(cherry picked from commit 197f9018a4641c8fc0725905ebfb535b61bed791)
Signed-off-by: Xiao Li <[email protected]>
commit e0aa23939a4cbf95f2cc83a7f5adee841b491358
Author: Liang-Chi Hsieh <[email protected]>
Date: 2017-05-25T01:55:45Z
[SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files
## What changes were proposed in this pull request?
This is a follow-up to #18073. Taking a safer approach to shutdown the pool
to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to
set a better thread name.
## How was this patch tested?
Manually test.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Liang-Chi Hsieh <[email protected]>
Closes #18100 from viirya/SPARK-20848-followup.
(cherry picked from commit 6b68d61cf31748a088778dfdd66491b2f89a3c7b)
Signed-off-by: Wenchen Fan <[email protected]>
commit b52a06d7034b3d392f7f0ee69a2fba098783e70d
Author: Xianyang Liu <[email protected]>
Date: 2017-05-25T07:47:59Z
[SPARK-20250][CORE] Improper OOM error when a task been killed while
spilling data
## What changes were proposed in this pull request?
Currently, when a task is calling spill() but it receives a killing request
from driver (e.g., speculative task), the `TaskMemoryManager` will throw an
`OOM` exception. And we don't catch `Fatal` exception when a error caused by
`Thread.interrupt`. So for `ClosedByInterruptException`, we should throw
`RuntimeException` instead of `OutOfMemoryError`.
https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK
## How was this patch tested?
Existing unit tests.
Author: Xianyang Liu <[email protected]>
Closes #18090 from ConeyLiu/SPARK-20250.
(cherry picked from commit 731462a04f8e33ac507ad19b4270c783a012a33e)
Signed-off-by: Wenchen Fan <[email protected]>
commit 8896c4ee9ea315a7dcd1a05b7201e7ad0539a5ed
Author: jinxing <[email protected]>
Date: 2017-05-25T08:11:30Z
[SPARK-19659] Fetch big blocks to disk when shuffle-read.
## What changes were proposed in this pull request?
Currently the whole block is fetched into memory(off heap by default) when
shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can
be large when skew situations. If OOM happens during shuffle read, job will be
killed and users will be notified to "Consider boosting
spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more
memory can resolve the OOM. However the approach is not perfectly suitable for
production environment, especially for data warehouse.
Using Spark SQL as data engine in warehouse, users hope to have a unified
parameter(e.g. memory) but less resource wasted(resource is allocated but not
used). The hope is strong especially when migrating data engine to Spark from
another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one
is very time consuming.
It's not always easy to predict skew situations, when happen, it make sense
to fetch remote blocks to disk for shuffle-read, rather than kill the job
because of OOM.
In this pr, I propose to fetch big blocks to disk(which is also mentioned
in SPARK-3019):
1. Track average size and also the outliers(which are larger than
2*avgSize) in MapStatus;
2. Request memory from `MemoryManager` before fetch blocks and release the
memory to `MemoryManager` when `ManagedBuffer` is released.
3. Fetch remote blocks to disk when failing acquiring memory from
`MemoryManager`, otherwise fetch to memory.
This is an improvement for memory control when shuffle blocks and help to
avoid OOM in scenarios like below:
1. Single huge block;
2. Sizes of many blocks are underestimated in `MapStatus` and the actual
footprint of blocks is much larger than the estimated.
## How was this patch tested?
Added unit test in `MapStatusSuite` and `ShuffleBlockFetcherIteratorSuite`.
Author: jinxing <[email protected]>
Closes #16989 from jinxing64/SPARK-19659.
(cherry picked from commit 3f94e64aa8fd806ae1fa0156d846ce96afacddd3)
Signed-off-by: Wenchen Fan <[email protected]>
commit 9cbf39f1c74f16483865cd93d6ffc3c521e878a7
Author: Yanbo Liang <[email protected]>
Date: 2017-05-25T12:15:15Z
[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.
## What changes were proposed in this pull request?
Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.
## How was this patch tested?
Existing UT.
Author: Yanbo Liang <[email protected]>
Closes #18089 from yanboliang/spark-19281.
(cherry picked from commit 913a6bfe4b0eb6b80a03b858ab4b2767194103de)
Signed-off-by: Yanbo Liang <[email protected]>
commit e01f1f222bcb7c469b1e1595e9338ed478d99894
Author: Yan Facai (é¢åæ) <[email protected]>
Date: 2017-05-25T13:40:39Z
[SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark
FPGrowth.
## What changes were proposed in this pull request?
Expose numPartitions (expert) param of PySpark FPGrowth.
## How was this patch tested?
+ [x] Pass all unit tests.
Author: Yan Facai (é¢åæ) <[email protected]>
Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.
(cherry picked from commit 139da116f130ed21481d3e9bdee5df4b8d7760ac)
Signed-off-by: Yanbo Liang <[email protected]>
commit 022a4957d8dc8d6049e0a8c9191fcfd1bd95a4a4
Author: Lior Regev <[email protected]>
Date: 2017-05-25T16:08:19Z
[SPARK-20741][SPARK SUBMIT] Added cleanup of JARs archive generated by
SparkSubmit
## What changes were proposed in this pull request?
Deleted generated JARs archive after distribution to HDFS
## How was this patch tested?
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Lior Regev <[email protected]>
Closes #17986 from liorregev/master.
(cherry picked from commit 7306d556903c832984c7f34f1e8fe738a4b2343c)
Signed-off-by: Sean Owen <[email protected]>
commit 5ae1c652147aba9c5087335b0c6916a1035090b2
Author: hyukjinkwon <[email protected]>
Date: 2017-05-25T16:10:30Z
[SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid
path check for sc.addJar on Windows
## What changes were proposed in this pull request?
This PR proposes two things:
- A follow up for SPARK-19707 (Improving the invalid path check for
sc.addJar on Windows as well).
```
org.apache.spark.SparkContextSuite:
- add jar with invalid path *** FAILED *** (32 milliseconds)
2 was not equal to 1 (SparkContextSuite.scala:309)
...
```
- Fix path vs URI related test failures on Windows.
```
org.apache.spark.storage.LocalDirsSuite:
- SPARK_LOCAL_DIRS override also affects driver *** FAILED *** (0
milliseconds)
new java.io.File("/NONEXISTENT_PATH").exists() was true
(LocalDirsSuite.scala:50)
...
- Utils.getLocalDir() throws an exception if any temporary directory
cannot be retrieved *** FAILED *** (15 milliseconds)
Expected exception java.io.IOException to be thrown, but no exception
was thrown. (LocalDirsSuite.scala:64)
...
```
```
org.apache.spark.sql.hive.HiveSchemaInferenceSuite:
- orc: schema should be inferred and saved when INFER_AND_SAVE is
specified *** FAILED *** (203 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index
2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254
...
- parquet: schema should be inferred and saved when INFER_AND_SAVE is
specified *** FAILED *** (203 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index
2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939
...
- orc: schema should be inferred but not stored when INFER_ONLY is
specified *** FAILED *** (141 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index
2: C:\projects\spark\target\tmp\spark-fb464e59-b049-481b-9c75-f53295c9fc2c
...
- parquet: schema should be inferred but not stored when INFER_ONLY is
specified *** FAILED *** (125 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index
2: C:\projects\spark\target\tmp\spark-9487568e-80a4-42b3-b0a5-d95314c4ccbc
...
- orc: schema should not be inferred when NEVER_INFER is specified ***
FAILED *** (156 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index
2: C:\projects\spark\target\tmp\spark-0d2dfa45-1b0f-4958-a8be-1074ed0135a
...
- parquet: schema should not be inferred when NEVER_INFER is specified ***
FAILED *** (547 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index
2: C:\projects\spark\target\tmp\spark-6d95d64e-613e-4a59-a0f6-d198c5aa51ee
...
```
```
org.apache.spark.sql.execution.command.DDLSuite:
- create temporary view using *** FAILED *** (15 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist:
file:/C:projectsspark arget
mpspark-3881d9ca-561b-488d-90b9-97587472b853 mp;
...
- insert data to a data source table which has a non-existing location
should succeed *** FAILED *** (109 milliseconds)
file:/C:projectsspark%09arget%09mpspark-4cad3d19-6085-4b75-b407-fe5e9d21df54
did not equal
file:///C:/projects/spark/target/tmp/spark-4cad3d19-6085-4b75-b407-fe5e9d21df54
(DDLSuite.scala:1869)
...
- insert into a data source table with a non-existing partition location
should succeed *** FAILED *** (94 milliseconds)
file:/C:projectsspark%09arget%09mpspark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d
did not equal
file:///C:/projects/spark/target/tmp/spark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d
(DDLSuite.scala:1910)
...
- read data from a data source table which has a non-existing location
should succeed *** FAILED *** (93 milliseconds)
file:/C:projectsspark%09arget%09mpspark-f8c281e2-08c2-4f73-abbf-f3865b702c34
did not equal
file:///C:/projects/spark/target/tmp/spark-f8c281e2-08c2-4f73-abbf-f3865b702c34
(DDLSuite.scala:1937)
...
- read data from a data source table with non-existing partition location
should succeed *** FAILED *** (110 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- create datasource table with a non-existing location *** FAILED *** (94
milliseconds)
file:/C:projectsspark%09arget%09mpspark-387316ae-070c-4e78-9b78-19ebf7b29ec8
did not equal
file:///C:/projects/spark/target/tmp/spark-387316ae-070c-4e78-9b78-19ebf7b29ec8
(DDLSuite.scala:1982)
...
- CTAS for external data source table with a non-existing location ***
FAILED *** (16 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- CTAS for external data source table with a existed location *** FAILED
*** (15 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- data source table:partition column name containing a b *** FAILED ***
(125 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- data source table:partition column name containing a:b *** FAILED ***
(143 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- data source table:partition column name containing a%b *** FAILED ***
(109 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- data source table:partition column name containing a,b *** FAILED ***
(109 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- location uri contains a b for datasource table *** FAILED *** (94
milliseconds)
file:/C:projectsspark%09arget%09mpspark-5739cda9-b702-4e14-932c-42e8c4174480a%20b
did not equal
file:///C:/projects/spark/target/tmp/spark-5739cda9-b702-4e14-932c-42e8c4174480/a%20b
(DDLSuite.scala:2084)
...
- location uri contains a:b for datasource table *** FAILED *** (78
milliseconds)
file:/C:projectsspark%09arget%09mpspark-9bdd227c-840f-4f08-b7c5-4036638f098da:b
did not equal
file:///C:/projects/spark/target/tmp/spark-9bdd227c-840f-4f08-b7c5-4036638f098d/a:b
(DDLSuite.scala:2084)
...
- location uri contains a%b for datasource table *** FAILED *** (78
milliseconds)
file:/C:projectsspark%09arget%09mpspark-62bb5f1d-fa20-460a-b534-cb2e172a3640a%25b
did not equal
file:///C:/projects/spark/target/tmp/spark-62bb5f1d-fa20-460a-b534-cb2e172a3640/a%25b
(DDLSuite.scala:2084)
...
- location uri contains a b for database *** FAILED *** (16 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- location uri contains a:b for database *** FAILED *** (15 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- location uri contains a%b for database *** FAILED *** (0 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
```
```
org.apache.spark.sql.hive.execution.HiveDDLSuite:
- create hive table with a non-existing location *** FAILED *** (16
milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- CTAS for external hive table with a non-existing location *** FAILED ***
(16 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- CTAS for external hive table with a existed location *** FAILED *** (16
milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- partition column name of parquet table containing a b *** FAILED ***
(156 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- partition column name of parquet table containing a:b *** FAILED *** (94
milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- partition column name of parquet table containing a%b *** FAILED ***
(125 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- partition column name of parquet table containing a,b *** FAILED ***
(110 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
- partition column name of hive table containing a b *** FAILED *** (15
milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- partition column name of hive table containing a:b *** FAILED *** (16
milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- partition column name of hive table containing a%b *** FAILED *** (16
milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- partition column name of hive table containing a,b *** FAILED *** (0
milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- hive table: location uri contains a b *** FAILED *** (0 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- hive table: location uri contains a:b *** FAILED *** (0 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
- hive table: location uri contains a%b *** FAILED *** (0 milliseconds)
org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path
from an empty string);
...
```
```
org.apache.spark.sql.sources.PathOptionSuite:
- path option also exist for write path *** FAILED *** (94 milliseconds)
file:/C:projectsspark%09arget%09mpspark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc
did not equal
file:///C:/projects/spark/target/tmp/spark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc
(PathOptionSuite.scala:98)
...
```
```
org.apache.spark.sql.CachedTableSuite:
- SPARK-19765: UNCACHE TABLE should un-cache all cached plans that refer
to this table *** FAILED *** (110 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
...
```
```
org.apache.spark.sql.execution.DataSourceScanExecRedactionSuite:
- treeString is redacted *** FAILED *** (250 milliseconds)
"file:/C:/projects/spark/target/tmp/spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28"
did not contain
"C:\projects\spark\target\tmp\spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28"
(DataSourceScanExecRedactionSuite.scala:46)
...
```
## How was this patch tested?
Tested via AppVeyor for each and checked it passed once each. These should
be retested via AppVeyor in this PR.
Author: hyukjinkwon <[email protected]>
Closes #17987 from HyukjinKwon/windows-20170515.
(cherry picked from commit e9f983df275c138626af35fd263a7abedf69297f)
Signed-off-by: Sean Owen <[email protected]>
commit 7a21de9e2bb0d9344a371a8570b2fffa68c3236e
Author: Shixiong Zhu <[email protected]>
Date: 2017-05-25T17:49:14Z
[SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples
project
## What changes were proposed in this pull request?
Add Structured Streaming Kafka Source to the `examples` project so that
people can run `bin/run-example StructuredKafkaWordCount ...`.
## How was this patch tested?
manually tested it.
Author: Shixiong Zhu <[email protected]>
Closes #18101 from zsxwing/add-missing-example-dep.
(cherry picked from commit 98c3852986a2cb5f2d249d6c8ef602be283bd90e)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 289dd170cb3e0b9eca9af5841a0155ceaffee447
Author: Michael Allman <[email protected]>
Date: 2017-05-26T01:25:43Z
[SPARK-20888][SQL][DOCS] Document change of default setting of
spark.sql.hive.caseSensitiveInferenceMode
(Link to Jira: https://issues.apache.org/jira/browse/SPARK-20888)
## What changes were proposed in this pull request?
Document change of default setting of
spark.sql.hive.caseSensitiveInferenceMode configuration key from NEVER_INFO to
INFER_AND_SAVE in the Spark SQL 2.1 to 2.2 migration notes.
Author: Michael Allman <[email protected]>
Closes #18112 from mallman/spark-20888-document_infer_and_save.
(cherry picked from commit c1e7989c4ffd83c51f5c97998b4ff6fe8dd83cf4)
Signed-off-by: Wenchen Fan <[email protected]>
commit fafe283277b50974c26684b06449086acd0cf05a
Author: Wenchen Fan <[email protected]>
Date: 2017-05-26T07:01:28Z
[SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after
FileChannel.transferTo
## What changes were proposed in this pull request?
Long time ago we fixed a
[bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about
`FileChannel.transferTo`. We were not very confident about that fix, so we
added a position check after the writing, try to discover the bug earlier.
However this checking is missing in the new `UnsafeShuffleWriter`, this PR
adds it.
https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that
`FileChannel.transferTo` bug, hopefully we can find out the root cause after
adding this position check.
## How was this patch tested?
N/A
Author: Wenchen Fan <[email protected]>
Closes #18091 from cloud-fan/shuffle.
(cherry picked from commit d9ad78908f6189719cec69d34557f1a750d2e6af)
Signed-off-by: Wenchen Fan <[email protected]>
commit f99456b5f6225a534ce52cf2b817285eb8853926
Author: NICHOLAS T. MARION <[email protected]>
Date: 2017-05-10T09:59:57Z
[SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities
## What changes were proposed in this pull request?
Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these
functions at any point that getParameter is called against a HttpServletRequest.
## How was this patch tested?
Unit tests, IBM Security AppScan Standard no longer showing
vulnerabilities, manual verification of WebUI pages.
Author: NICHOLAS T. MARION <[email protected]>
Closes #17686 from n-marion/xss-fix.
(cherry picked from commit b512233a457092b0e2a39d0b42cb021abc69d375)
Signed-off-by: Sean Owen <[email protected]>
commit 92837aeb47fc3427166e4b6e62f6130f7480d7fa
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-05-16T21:47:21Z
[SPARK-19372][SQL] Fix throwing a Java exception at df.fliter() due to 64KB
bytecode size limit
## What changes were proposed in this pull request?
When an expression for `df.filter()` has many nodes (e.g. 400), the size of
Java bytecode for the generated Java code is more than 64KB. It produces an
Java exception. As a result, the execution fails.
This PR continues to execute by calling `Expression.eval()` disabling code
generation if an exception has been caught.
## How was this patch tested?
Add a test suite into `DataFrameSuite`
Author: Kazuaki Ishizaki <[email protected]>
Closes #17087 from kiszk/SPARK-19372.
commit 2b59ed4f1d4e859d5987b6eaaee074260b2a12f8
Author: Michael Armbrust <[email protected]>
Date: 2017-05-26T20:33:23Z
[SPARK-20844] Remove experimental from Structured Streaming APIs
Now that Structured Streaming has been out for several Spark release and
has large production use cases, the `Experimental` label is no longer
appropriate. I've left `InterfaceStability.Evolving` however, as I think we
may make a few changes to the pluggable Source & Sink API in Spark 2.3.
Author: Michael Armbrust <[email protected]>
Closes #18065 from marmbrus/streamingGA.
commit 30922dec8a8cc598b6715f85281591208a91df00
Author: zero323 <[email protected]>
Date: 2017-05-26T22:01:01Z
[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and
sortBy in SQL guide
## What changes were proposed in this pull request?
- Add Scala, Python and Java examples for `partitionBy`, `sortBy` and
`bucketBy`.
- Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
- Remove bucketing from Unsupported Hive Functionalities.
## How was this patch tested?
Manual tests, docs build.
Author: zero323 <[email protected]>
Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
(cherry picked from commit ae33abf71b353c638487948b775e966c7127cd46)
Signed-off-by: Xiao Li <[email protected]>
commit fc799d730304c6a176636b414fc15184e89367d7
Author: Yu Peng <[email protected]>
Date: 2017-05-26T23:28:36Z
[SPARK-10643][CORE] Make spark-submit download remote files to local in
client mode
## What changes were proposed in this pull request?
This PR makes spark-submit script download remote files to local file
system for local/standalone client mode.
## How was this patch tested?
- Unit tests
- Manual tests by adding s3a jar and testing against file on s3.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Yu Peng <[email protected]>
Closes #18078 from loneknightpy/download-jar-in-spark-submit.
(cherry picked from commit 4af37812915763ac3bfd91a600a7f00a4b84d29a)
Signed-off-by: Xiao Li <[email protected]>
commit 39f76657ef2967f4c87230e06cbbb1611c276375
Author: Wenchen Fan <[email protected]>
Date: 2017-05-27T02:57:43Z
[SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read
## What changes were proposed in this pull request?
This PR includes some minor improvement for the comments and tests in
https://github.com/apache/spark/pull/16989
## How was this patch tested?
N/A
Author: Wenchen Fan <[email protected]>
Closes #18117 from cloud-fan/follow.
(cherry picked from commit 1d62f8aca82601506c44b6fd852f4faf3602d7e2)
Signed-off-by: Wenchen Fan <[email protected]>
commit f2408bdd7a0950385ee1364e006d55bfa6e5a200
Author: Shixiong Zhu <[email protected]>
Date: 2017-05-27T05:25:38Z
[SPARK-20843][CORE] Add a config to set driver terminate timeout
## What changes were proposed in this pull request?
Add a `worker` configuration to set how long to wait before forcibly
killing driver.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #18126 from zsxwing/SPARK-20843.
(cherry picked from commit 6c1dbd6fc8d49acf7c1c902d2ebf89ed5e788a4e)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 25e87d80c483785dc4a79fb283bc80f68197bf4a
Author: Wenchen Fan <[email protected]>
Date: 2017-05-27T23:16:51Z
[SPARK-20897][SQL] cached self-join should not fail
## What changes were proposed in this pull request?
The failed test case is, we have a `SortMergeJoinExec` for a self-join,
which means we have a `ReusedExchange` node in the query plan. It works fine
without caching, but throws an exception in
`SortMergeJoinExec.outputPartitioning` if we cache it.
The root cause is, `ReusedExchange` doesn't propagate the output
partitioning from its child, so in `SortMergeJoinExec.outputPartitioning` we
create `PartitioningCollection` with a hash partitioning and an unknown
partitioning, and fail.
This bug is mostly fine, because inserting the `ReusedExchange` is the last
step to prepare the physical plan, we won't call
`SortMergeJoinExec.outputPartitioning` anymore after this.
However, if the dataframe is cached, the physical plan of it becomes
`InMemoryTableScanExec`, which contains another physical plan representing the
cached query, and it has gone through the entire planning phase and may have
`ReusedExchange`. Then the planner call
`InMemoryTableScanExec.outputPartitioning`, which then calls
`SortMergeJoinExec.outputPartitioning` and trigger this bug.
## How was this patch tested?
a new regression test
Author: Wenchen Fan <[email protected]>
Closes #18121 from cloud-fan/bug.
(cherry picked from commit 08ede46b897b7e52cfe8231ffc21d9515122cf49)
Signed-off-by: Xiao Li <[email protected]>
commit dc51be1e79b89d143da5df16b893df86a306f059
Author: Xiao Li <[email protected]>
Date: 2017-05-28T04:32:18Z
[SPARK-20908][SQL] Cache Manager: Hint should be ignored in plan matching
### What changes were proposed in this pull request?
In Cache manager, the plan matching should ignore Hint.
```Scala
val df1 = spark.range(10).join(broadcast(spark.range(10)))
df1.cache()
spark.range(10).join(spark.range(10)).explain()
```
The output plan of the above query shows that the second query is not
using the cached data of the first query.
```
BroadcastNestedLoopJoin BuildRight, Inner
:- *Range (0, 10, step=1, splits=2)
+- BroadcastExchange IdentityBroadcastMode
+- *Range (0, 10, step=1, splits=2)
```
After the fix, the plan becomes
```
InMemoryTableScan [id#20L, id#23L]
+- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk,
memory, deserialized, 1 replicas)
+- BroadcastNestedLoopJoin BuildRight, Inner
:- *Range (0, 10, step=1, splits=2)
+- BroadcastExchange IdentityBroadcastMode
+- *Range (0, 10, step=1, splits=2)
```
### How was this patch tested?
Added a test.
Author: Xiao Li <[email protected]>
Closes #18131 from gatorsmile/HintCache.
(cherry picked from commit 06c155c90dc784b07002f33d98dcfe9be1e38002)
Signed-off-by: Xiao Li <[email protected]>
commit 26640a26984bac4fc1037714e60bd3607929b377
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-05-29T19:17:14Z
[SPARK-20907][TEST] Use testQuietly for test suites that generate long log
output
## What changes were proposed in this pull request?
Supress console output by using `testQuietly` in test suites
## How was this patch tested?
Tested by `"SPARK-19372: Filter can be executed w/o generated code due to
JVM code size limit"` in `DataFrameSuite`
Author: Kazuaki Ishizaki <[email protected]>
Closes #18135 from kiszk/SPARK-20907.
(cherry picked from commit c9749068ecf8e0acabdfeeceeedff0f1f73293b7)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 3b79e4cda74e0bf82ec55e673beb8f84e7cfaca4
Author: Yuming Wang <[email protected]>
Date: 2017-05-29T23:10:22Z
[SPARK-8184][SQL] Add additional function description for weekofyear
## What changes were proposed in this pull request?
Add additional function description for weekofyear.
## How was this patch tested?
manual tests

Author: Yuming Wang <[email protected]>
Closes #18132 from wangyum/SPARK-8184.
(cherry picked from commit 1c7db00c74ec6a91c7eefbdba85cbf41fbe8634a)
Signed-off-by: Reynold Xin <[email protected]>
commit f6730a70cb47ebb3df7f42209df7b076aece1093
Author: Prashant Sharma <[email protected]>
Date: 2017-05-30T01:12:01Z
[SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of
creating one every batch.
## What changes were proposed in this pull request?
In summary, cost of recreating a KafkaProducer for writing every batch is
high as it starts a lot threads and make connections and then closes them. A
KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of
KafkaProducer instance while writing via multiple threads is encouraged.
Furthermore, I have performance improvement of 10x in latency, with this
patch.
### These are times that addBatch took in ms. Without applying this patch

### These are times that addBatch took in ms. After applying this patch

## How was this patch tested?
Running distributed benchmarks comparing runs with this patch and without
it.
Added relevant unit tests.
Author: Prashant Sharma <[email protected]>
Closes #17308 from ScrapCodes/cached-kafka-producer.
(cherry picked from commit 96a4d1d0827fc3fba83f174510b061684f0d00f7)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 5fdc7d80f46d51d4a8e49d9390b191fff42ec222
Author: Xiao Li <[email protected]>
Date: 2017-05-30T21:06:19Z
[SPARK-20924][SQL] Unable to call the function registered in the
not-current database
### What changes were proposed in this pull request?
We are unable to call the function registered in the not-current database.
```Scala
sql("CREATE DATABASE dAtABaSe1")
sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS
'${classOf[GenericUDAFAverage].getName}'")
sql("SELECT dAtABaSe1.test_avg(1)")
```
The above code returns an error:
```
Undefined function: 'dAtABaSe1.test_avg'. This function is neither a
registered temporary function nor a permanent function registered in the
database 'default'.; line 1 pos 7
```
This PR is to fix the above issue.
### How was this patch tested?
Added test cases.
Author: Xiao Li <[email protected]>
Closes #18146 from gatorsmile/qualifiedFunction.
(cherry picked from commit 4bb6a53ebd06de3de97139a2dbc7c85fc3aa3e66)
Signed-off-by: Wenchen Fan <[email protected]>
commit 287440df6816b5c9f2be2aee949a4c20ab165180
Author: jerryshao <[email protected]>
Date: 2017-05-31T03:24:43Z
[SPARK-20275][UI] Do not display "Completed" column for in-progress
applications
## What changes were proposed in this pull request?
Current HistoryServer will display completed date of in-progress
application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of
unnecessarily showing this incorrect completed date, here propose to make this
column invisible for in-progress applications.
The purpose of only making this column invisible rather than deleting this
field is that: this data is fetched through REST API, and in the REST API the
format is like below shows, in which `endTime` matches `endTimeEpoch`. So
instead of changing REST API to break backward compatibility, here choosing a
simple solution to only make this column invisible.
```
[ {
"id" : "local-1491805439678",
"name" : "Spark shell",
"attempts" : [ {
"startTime" : "2017-04-10T06:23:57.574GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2017-04-10T06:23:57.574GMT",
"duration" : 0,
"sparkUser" : "",
"completed" : false,
"startTimeEpoch" : 1491805437574,
"endTimeEpoch" : -1,
"lastUpdatedEpoch" : 1491805437574
} ]
} ]%
```
Here is UI before changed:
<img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm"
src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">
And after:
<img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm"
src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png">
## How was this patch tested?
Manual verification.
Author: jerryshao <[email protected]>
Closes #17588 from jerryshao/SPARK-20275.
(cherry picked from commit 52ed9b289d169219f7257795cbedc56565a39c71)
Signed-off-by: Wenchen Fan <[email protected]>
commit 3cad66e5e06a4020a16fa757fbf67f666b319bab
Author: Felix Cheung <[email protected]>
Date: 2017-05-31T05:33:29Z
[SPARK-20877][SPARKR][WIP] add timestamps to test runs
to investigate how long they run
Jenkins, AppVeyor
Author: Felix Cheung <[email protected]>
Closes #18104 from felixcheung/rtimetest.
(cherry picked from commit 382fefd1879e4670f3e9e8841ec243e3eb11c578)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit 3686c2e965758f471f9784b3e06223ce143b6aca
Author: David Eis <[email protected]>
Date: 2017-05-31T12:52:55Z
[SPARK-20790][MLLIB] Correctly handle negative values for implicit feedback
in ALS
## What changes were proposed in this pull request?
Revert the handling of negative values in ALS with implicit feedback, so
that the confidence is the absolute value of the rating and the preference is 0
for negative ratings. This was the original behavior.
## How was this patch tested?
This patch was tested with the existing unit tests and an added unit test
to ensure that negative ratings are not ignored.
mengxr
Author: David Eis <[email protected]>
Closes #18022 from davideis/bugfix/negative-rating.
(cherry picked from commit d52f636228e833db89045bc7a0c17b72da13f138)
Signed-off-by: Sean Owen <[email protected]>
commit f59f9a380351726de20453ab101f46e199a7079c
Author: liuxian <[email protected]>
Date: 2017-05-31T18:43:36Z
[SPARK-20876][SQL][BACKPORT-2.2] If the input parameter is float type for
ceil or floor,the result is not we expected
## What changes were proposed in this pull request?
This PR is to backport #18103 to Spark 2.2
## How was this patch tested?
unit test
Author: liuxian <[email protected]>
Closes #18155 from 10110346/wip-lx-0531.
commit a607a26b344470bbff1247908d49b848bb7918a0
Author: Shixiong Zhu <[email protected]>
Date: 2017-06-01T00:26:18Z
[SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException
## What changes were proposed in this pull request?
`IllegalAccessError` is a fatal error (a subclass of LinkageError) and its
meaning is `Thrown if an application attempts to access or modify a field, or
to call a method that it does not have access to`. Throwing a fatal error for
AccumulatorV2 is not necessary and is pretty bad because it usually will just
kill executors or SparkContext
([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example
of killing SparkContext due to `IllegalAccessError`). I think the correct type
of exception in AccumulatorV2 should be `IllegalStateException`.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #18168 from zsxwing/SPARK-20940.
(cherry picked from commit 24db35826a81960f08e3eb68556b0f51781144e1)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 14fda6f313c63c9d5a86595c12acfb1e36df43ad
Author: jerryshao <[email protected]>
Date: 2017-06-01T05:34:53Z
[SPARK-20244][CORE] Handle incorrect bytesRead metrics when using PySpark
## What changes were proposed in this pull request?
Hadoop FileSystem's statistics in based on thread local variables, this is
ok if the RDD computation chain is running in the same thread. But if child RDD
creates another thread to consume the iterator got from Hadoop RDDs, the
bytesRead computation will be error, because now the iterator's `next()` and
`close()` may run in different threads. This could be happened when using
PySpark with PythonRDD.
So here building a map to track the `bytesRead` for different thread and
add them together. This method will be used in three RDDs, `HadoopRDD`,
`NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called
directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`.
## How was this patch tested?
Unit test and local cluster verification.
Author: jerryshao <[email protected]>
Closes #17617 from jerryshao/SPARK-20244.
(cherry picked from commit 5854f77ce1d3b9491e2a6bd1f352459da294e369)
Signed-off-by: Wenchen Fan <[email protected]>
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]