[spark] branch branch-2.3 updated: [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x

2019-04-19 Thread shaneknapp
This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new a85ab12  [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable 
to 3.6.x
a85ab12 is described below

commit a85ab120e3d29a323e6d28aa307d4c20ee5f2c6c
Author: shane knapp 
AuthorDate: Fri Apr 19 09:45:40 2019 -0700

[SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x

## What changes were proposed in this pull request?

have jenkins test against python3.6 (instead of 3.4).

## How was this patch tested?

extensive testing on both the centos and ubuntu jenkins workers revealed 
that 2.3 probably doesn't like python 3.6... :(

NOTE: this is just for branch-2.3

PLEASE DO NOT MERGE

Author: shane knapp 

Closes #24380 from shaneknapp/update-python-executable-2.3.
---
 python/run-tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 3539c76..d571855 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -114,7 +114,7 @@ def run_individual_python_test(test_name, pyspark_python):
 
 
 def get_default_python_executables():
-python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)]
+python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)]
 if "python2.7" not in python_execs:
 LOGGER.warning("Not testing against `python2.7` because it could not 
be found; falling"
" back to `python` instead")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x

2019-04-19 Thread shaneknapp
This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new eaa88ae  [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable 
to 3.6.x
eaa88ae is described below

commit eaa88ae5237b23fb7497838f3897a64641efe383
Author: shane knapp 
AuthorDate: Fri Apr 19 09:44:06 2019 -0700

[SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x

## What changes were proposed in this pull request?

have jenkins test against python3.6 (instead of 3.4).

## How was this patch tested?

extensive testing on both the centos and ubuntu jenkins workers revealed 
that 2.4 doesn't like python 3.6...  :(

NOTE: this is just for branch-2.4

PLEASE DO NOT MERGE

Closes #24379 from shaneknapp/update-python-executable.

Authored-by: shane knapp 
Signed-off-by: shane knapp 
---
 python/run-tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index ccbdfac..921fdc9 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -162,7 +162,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 
 
 def get_default_python_executables():
-python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)]
+python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)]
 if "python2.7" not in python_execs:
 LOGGER.warning("Not testing against `python2.7` because it could not 
be found; falling"
" back to `python` instead")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2

2019-04-19 Thread lixiao
This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 777b450  [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 
for hadoop-3.2
777b450 is described below

commit 777b4502b206b7240c6655d3c0b0a9ce08f6a09c
Author: Yuming Wang 
AuthorDate: Fri Apr 19 08:59:08 2019 -0700

[SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2

## What changes were proposed in this pull request?

When we compile and test Hadoop 3.2, we will hint the following two issues:
1. JobSummaryLevel is not a member of object 
org.apache.parquet.hadoop.ParquetOutputFormat. Fixed by 
[PARQUET-381](https://issues.apache.org/jira/browse/PARQUET-381)(Parquet 1.9.0)
2. java.lang.NoSuchFieldError: BROTLI
at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31).
 Fixed by 
[PARQUET-1143](https://issues.apache.org/jira/browse/PARQUET-1143)(Parquet 
1.10.0)

The reason is that the `parquet-hadoop-bundle-1.8.1.jar` conflicts with 
Parquet 1.10.1.
I think it would be safe to upgrade Hive's parquet to 1.10.1 to workaround 
this issue.

This is what Hive did when upgrading Parquet 1.8.1 to 1.10.0: 
[HIVE-17000](https://issues.apache.org/jira/browse/HIVE-17000) and 
[HIVE-19464](https://issues.apache.org/jira/browse/HIVE-19464). We can see that 
all changes are related to vectors, and vectors are disabled by default: see 
[HIVE-14826](https://issues.apache.org/jira/browse/HIVE-14826) and 
[HiveConf.java#L2723](https://github.com/apache/hive/blob/rel/release-2.3.4/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2723).

This pr removes 
[parquet-hadoop-bundle-1.8.1.jar](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop-bundle)
 , so Hive serde will use [parquet-common-1.10.1.jar, parquet-column-1.10.1.jar 
and 
parquet-hadoop-1.10.1.jar](https://github.com/apache/spark/blob/master/dev/deps/spark-deps-hadoop-3.2#L185-L189).

## How was this patch tested?

1. manual tests
2. [upgrade Hive Parquet to 1.10.1 annd run Hadoop 3.2 test on 
jenkins](https://github.com/apache/spark/pull/24044#commits-pushed-0c3f962)

Closes #24346 from wangyum/SPARK-27176.

Authored-by: Yuming Wang 
Signed-off-by: gatorsmile 
---
 dev/deps/spark-deps-hadoop-3.2 | 1 -
 pom.xml| 8 +---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2
index a45f02d..8b3bd79 100644
--- a/dev/deps/spark-deps-hadoop-3.2
+++ b/dev/deps/spark-deps-hadoop-3.2
@@ -187,7 +187,6 @@ parquet-common-1.10.1.jar
 parquet-encoding-1.10.1.jar
 parquet-format-2.4.0.jar
 parquet-hadoop-1.10.1.jar
-parquet-hadoop-bundle-1.6.0.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
diff --git a/pom.xml b/pom.xml
index fce4cbd..5879a76 100644
--- a/pom.xml
+++ b/pom.xml
@@ -221,6 +221,7 @@
 -->
 compile
 compile
+${hive.deps.scope}
 compile
 compile
 test
@@ -2004,7 +2005,7 @@
 ${hive.parquet.group}
 parquet-hadoop-bundle
 ${hive.parquet.version}
-compile
+${hive.parquet.scope}
   
   
 org.codehaus.janino
@@ -2818,8 +2819,9 @@
 core
 ${hive23.version}
 2.3.4
-org.apache.parquet
-1.8.1
+
+provided
 
 4.1.17
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite

2019-04-19 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 16bbe0f  [SPARK-27486][CORE][TEST] Enable History server storage 
information test in the HistoryServerSuite
16bbe0f is described below

commit 16bbe0f798d2a75ba0a9f2daff709292f6971755
Author: Shahid 
AuthorDate: Fri Apr 19 08:12:20 2019 -0700

[SPARK-27486][CORE][TEST] Enable History server storage information test in 
the HistoryServerSuite

## What changes were proposed in this pull request?

We have disabled a test related to storage in the History server suite 
after SPARK-13845. But, after SPARK-22050, we can store the information about 
block updated events to eventLog, if we enable 
"spark.eventLog.logBlockUpdates.enabled=true".

   So, we can enable the test, by adding an eventlog corresponding to the 
application, which has enabled the configuration, 
"spark.eventLog.logBlockUpdates.enabled=true"

## How was this patch tested?
Existing UTs

Closes #24390 from shahidki31/enableRddStorageTest.

Authored-by: Shahid 
Signed-off-by: Sean Owen 
---
 .../one_rdd_storage_json_expectation.json  | 10 ++
 .../org/apache/spark/deploy/history/HistoryServerSuite.scala   |  8 +---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git 
a/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json
 
b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json
new file mode 100644
index 000..09afdf5
--- /dev/null
+++ 
b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json
@@ -0,0 +1,10 @@
+{
+  "id" : 0,
+  "name" : "0",
+  "numPartitions" : 8,
+  "numCachedPartitions" : 0,
+  "storageLevel" : "Memory Deserialized 1x Replicated",
+  "memoryUsed" : 0,
+  "diskUsed" : 0,
+  "partitions" : [ ]
+}
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
index c99bcf2..5ba27a0 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
@@ -174,9 +174,11 @@ class HistoryServerSuite extends SparkFunSuite with 
BeforeAndAfter with Matchers
 "executor node blacklisting unblacklisting" -> 
"applications/app-20161115172038-/executors",
 "executor memory usage" -> 
"applications/app-20161116163331-/executors",
 
-"app environment" -> "applications/app-20161116163331-/environment"
-// Todo: enable this test when logging the even of onBlockUpdated. See: 
SPARK-13845
-// "one rdd storage json" -> 
"applications/local-1422981780767/storage/rdd/0"
+"app environment" -> "applications/app-20161116163331-/environment",
+
+// Enable "spark.eventLog.logBlockUpdates.enabled", to get the storage 
information
+// in the history server.
+"one rdd storage json" -> "applications/local-1422981780767/storage/rdd/0"
   )
 
   // run a bunch of characterization tests -- just verify the behavior is the 
same as what is saved


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-27504][SQL] File source V2: support refreshing metadata cache

2019-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 31488e1  [SPARK-27504][SQL] File source V2: support refreshing 
metadata cache
31488e1 is described below

commit 31488e1ca506efd34459e6bc9a08b6d0956c8d44
Author: Gengliang Wang 
AuthorDate: Fri Apr 19 18:26:03 2019 +0800

[SPARK-27504][SQL] File source V2: support refreshing metadata cache

## What changes were proposed in this pull request?

In file source V1, if some file is deleted manually, reading the 
DataFrame/Table will throws an exception with suggestion message
```
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
```
After refreshing the table/DataFrame, the reads should return correct 
results.

We should follow it in file source V2 as well.
## How was this patch tested?
Unit test

Closes #24401 from gengliangwang/refreshFileTable.

Authored-by: Gengliang Wang 
Signed-off-by: Wenchen Fan 
---
 .../datasources/v2/DataSourceV2Relation.scala  |  5 +++
 .../datasources/v2/FilePartitionReader.scala   | 38 --
 .../org/apache/spark/sql/MetadataCacheSuite.scala  | 35 ++--
 3 files changed, 50 insertions(+), 28 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
index 4119957..e7e0be0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
@@ -66,6 +66,11 @@ case class DataSourceV2Relation(
   override def newInstance(): DataSourceV2Relation = {
 copy(output = output.map(_.newInstance()))
   }
+
+  override def refresh(): Unit = table match {
+case table: FileTable => table.fileIndex.refresh()
+case _ => // Do nothing.
+  }
 }
 
 /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
index 7c7b468..d4bad29 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
@@ -34,25 +34,27 @@ class FilePartitionReader[T](readers: 
Iterator[PartitionedFileReader[T]])
   override def next(): Boolean = {
 if (currentReader == null) {
   if (readers.hasNext) {
-if (ignoreMissingFiles || ignoreCorruptFiles) {
-  try {
-currentReader = getNextReader()
-  } catch {
-case e: FileNotFoundException if ignoreMissingFiles =>
-  logWarning(s"Skipped missing file: $currentReader", e)
-  currentReader = null
-  return false
-// Throw FileNotFoundException even if `ignoreCorruptFiles` is true
-case e: FileNotFoundException if !ignoreMissingFiles => throw e
-case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
-  logWarning(
-s"Skipped the rest of the content in the corrupted file: 
$currentReader", e)
-  currentReader = null
-  InputFileBlockHolder.unset()
-  return false
-  }
-} else {
+try {
   currentReader = getNextReader()
+} catch {
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: $currentReader", e)
+currentReader = null
+return false
+  // Throw FileNotFoundException even if `ignoreCorruptFiles` is true
+  case e: FileNotFoundException if !ignoreMissingFiles =>
+throw new FileNotFoundException(
+  e.getMessage + "\n" +
+"It is possible the underlying files have been updated. " +
+"You can explicitly invalidate the cache in Spark by " +
+"running 'REFRESH TABLE tableName' command in SQL or " +
+"by recreating the Dataset/DataFrame involved.")
+  case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+logWarning(
+  s"Skipped the rest of the content in the corrupted file: 
$currentReader", e)
+currentReader = null
+InputFileBlockHolder.unset()
+return false
 }
   } else {
 return false
diff --git 
a/sql/core/src/test/sc