[spark] branch branch-2.3 updated: [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new a85ab12 [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x a85ab12 is described below commit a85ab120e3d29a323e6d28aa307d4c20ee5f2c6c Author: shane knapp AuthorDate: Fri Apr 19 09:45:40 2019 -0700 [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.3 probably doesn't like python 3.6... :( NOTE: this is just for branch-2.3 PLEASE DO NOT MERGE Author: shane knapp Closes #24380 from shaneknapp/update-python-executable-2.3. --- python/run-tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index 3539c76..d571855 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -114,7 +114,7 @@ def run_individual_python_test(test_name, pyspark_python): def get_default_python_executables(): -python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)] +python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)] if "python2.7" not in python_execs: LOGGER.warning("Not testing against `python2.7` because it could not be found; falling" " back to `python` instead") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new eaa88ae [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x eaa88ae is described below commit eaa88ae5237b23fb7497838f3897a64641efe383 Author: shane knapp AuthorDate: Fri Apr 19 09:44:06 2019 -0700 [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.4 doesn't like python 3.6... :( NOTE: this is just for branch-2.4 PLEASE DO NOT MERGE Closes #24379 from shaneknapp/update-python-executable. Authored-by: shane knapp Signed-off-by: shane knapp --- python/run-tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index ccbdfac..921fdc9 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -162,7 +162,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): def get_default_python_executables(): -python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)] +python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)] if "python2.7" not in python_execs: LOGGER.warning("Not testing against `python2.7` because it could not be found; falling" " back to `python` instead") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 777b450 [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2 777b450 is described below commit 777b4502b206b7240c6655d3c0b0a9ce08f6a09c Author: Yuming Wang AuthorDate: Fri Apr 19 08:59:08 2019 -0700 [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2 ## What changes were proposed in this pull request? When we compile and test Hadoop 3.2, we will hint the following two issues: 1. JobSummaryLevel is not a member of object org.apache.parquet.hadoop.ParquetOutputFormat. Fixed by [PARQUET-381](https://issues.apache.org/jira/browse/PARQUET-381)(Parquet 1.9.0) 2. java.lang.NoSuchFieldError: BROTLI at org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31). Fixed by [PARQUET-1143](https://issues.apache.org/jira/browse/PARQUET-1143)(Parquet 1.10.0) The reason is that the `parquet-hadoop-bundle-1.8.1.jar` conflicts with Parquet 1.10.1. I think it would be safe to upgrade Hive's parquet to 1.10.1 to workaround this issue. This is what Hive did when upgrading Parquet 1.8.1 to 1.10.0: [HIVE-17000](https://issues.apache.org/jira/browse/HIVE-17000) and [HIVE-19464](https://issues.apache.org/jira/browse/HIVE-19464). We can see that all changes are related to vectors, and vectors are disabled by default: see [HIVE-14826](https://issues.apache.org/jira/browse/HIVE-14826) and [HiveConf.java#L2723](https://github.com/apache/hive/blob/rel/release-2.3.4/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2723). This pr removes [parquet-hadoop-bundle-1.8.1.jar](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop-bundle) , so Hive serde will use [parquet-common-1.10.1.jar, parquet-column-1.10.1.jar and parquet-hadoop-1.10.1.jar](https://github.com/apache/spark/blob/master/dev/deps/spark-deps-hadoop-3.2#L185-L189). ## How was this patch tested? 1. manual tests 2. [upgrade Hive Parquet to 1.10.1 annd run Hadoop 3.2 test on jenkins](https://github.com/apache/spark/pull/24044#commits-pushed-0c3f962) Closes #24346 from wangyum/SPARK-27176. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- dev/deps/spark-deps-hadoop-3.2 | 1 - pom.xml| 8 +--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2 index a45f02d..8b3bd79 100644 --- a/dev/deps/spark-deps-hadoop-3.2 +++ b/dev/deps/spark-deps-hadoop-3.2 @@ -187,7 +187,6 @@ parquet-common-1.10.1.jar parquet-encoding-1.10.1.jar parquet-format-2.4.0.jar parquet-hadoop-1.10.1.jar -parquet-hadoop-bundle-1.6.0.jar parquet-jackson-1.10.1.jar protobuf-java-2.5.0.jar py4j-0.10.8.1.jar diff --git a/pom.xml b/pom.xml index fce4cbd..5879a76 100644 --- a/pom.xml +++ b/pom.xml @@ -221,6 +221,7 @@ --> compile compile +${hive.deps.scope} compile compile test @@ -2004,7 +2005,7 @@ ${hive.parquet.group} parquet-hadoop-bundle ${hive.parquet.version} -compile +${hive.parquet.scope} org.codehaus.janino @@ -2818,8 +2819,9 @@ core ${hive23.version} 2.3.4 -org.apache.parquet -1.8.1 + +provided 4.1.17 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 16bbe0f [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite 16bbe0f is described below commit 16bbe0f798d2a75ba0a9f2daff709292f6971755 Author: Shahid AuthorDate: Fri Apr 19 08:12:20 2019 -0700 [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite ## What changes were proposed in this pull request? We have disabled a test related to storage in the History server suite after SPARK-13845. But, after SPARK-22050, we can store the information about block updated events to eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". So, we can enable the test, by adding an eventlog corresponding to the application, which has enabled the configuration, "spark.eventLog.logBlockUpdates.enabled=true" ## How was this patch tested? Existing UTs Closes #24390 from shahidki31/enableRddStorageTest. Authored-by: Shahid Signed-off-by: Sean Owen --- .../one_rdd_storage_json_expectation.json | 10 ++ .../org/apache/spark/deploy/history/HistoryServerSuite.scala | 8 +--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json new file mode 100644 index 000..09afdf5 --- /dev/null +++ b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json @@ -0,0 +1,10 @@ +{ + "id" : 0, + "name" : "0", + "numPartitions" : 8, + "numCachedPartitions" : 0, + "storageLevel" : "Memory Deserialized 1x Replicated", + "memoryUsed" : 0, + "diskUsed" : 0, + "partitions" : [ ] +} diff --git a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala index c99bcf2..5ba27a0 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala @@ -174,9 +174,11 @@ class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with Matchers "executor node blacklisting unblacklisting" -> "applications/app-20161115172038-/executors", "executor memory usage" -> "applications/app-20161116163331-/executors", -"app environment" -> "applications/app-20161116163331-/environment" -// Todo: enable this test when logging the even of onBlockUpdated. See: SPARK-13845 -// "one rdd storage json" -> "applications/local-1422981780767/storage/rdd/0" +"app environment" -> "applications/app-20161116163331-/environment", + +// Enable "spark.eventLog.logBlockUpdates.enabled", to get the storage information +// in the history server. +"one rdd storage json" -> "applications/local-1422981780767/storage/rdd/0" ) // run a bunch of characterization tests -- just verify the behavior is the same as what is saved - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27504][SQL] File source V2: support refreshing metadata cache
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 31488e1 [SPARK-27504][SQL] File source V2: support refreshing metadata cache 31488e1 is described below commit 31488e1ca506efd34459e6bc9a08b6d0956c8d44 Author: Gengliang Wang AuthorDate: Fri Apr 19 18:26:03 2019 +0800 [SPARK-27504][SQL] File source V2: support refreshing metadata cache ## What changes were proposed in this pull request? In file source V1, if some file is deleted manually, reading the DataFrame/Table will throws an exception with suggestion message ``` It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. ``` After refreshing the table/DataFrame, the reads should return correct results. We should follow it in file source V2 as well. ## How was this patch tested? Unit test Closes #24401 from gengliangwang/refreshFileTable. Authored-by: Gengliang Wang Signed-off-by: Wenchen Fan --- .../datasources/v2/DataSourceV2Relation.scala | 5 +++ .../datasources/v2/FilePartitionReader.scala | 38 -- .../org/apache/spark/sql/MetadataCacheSuite.scala | 35 ++-- 3 files changed, 50 insertions(+), 28 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala index 4119957..e7e0be0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala @@ -66,6 +66,11 @@ case class DataSourceV2Relation( override def newInstance(): DataSourceV2Relation = { copy(output = output.map(_.newInstance())) } + + override def refresh(): Unit = table match { +case table: FileTable => table.fileIndex.refresh() +case _ => // Do nothing. + } } /** diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala index 7c7b468..d4bad29 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala @@ -34,25 +34,27 @@ class FilePartitionReader[T](readers: Iterator[PartitionedFileReader[T]]) override def next(): Boolean = { if (currentReader == null) { if (readers.hasNext) { -if (ignoreMissingFiles || ignoreCorruptFiles) { - try { -currentReader = getNextReader() - } catch { -case e: FileNotFoundException if ignoreMissingFiles => - logWarning(s"Skipped missing file: $currentReader", e) - currentReader = null - return false -// Throw FileNotFoundException even if `ignoreCorruptFiles` is true -case e: FileNotFoundException if !ignoreMissingFiles => throw e -case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles => - logWarning( -s"Skipped the rest of the content in the corrupted file: $currentReader", e) - currentReader = null - InputFileBlockHolder.unset() - return false - } -} else { +try { currentReader = getNextReader() +} catch { + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: $currentReader", e) +currentReader = null +return false + // Throw FileNotFoundException even if `ignoreCorruptFiles` is true + case e: FileNotFoundException if !ignoreMissingFiles => +throw new FileNotFoundException( + e.getMessage + "\n" + +"It is possible the underlying files have been updated. " + +"You can explicitly invalidate the cache in Spark by " + +"running 'REFRESH TABLE tableName' command in SQL or " + +"by recreating the Dataset/DataFrame involved.") + case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles => +logWarning( + s"Skipped the rest of the content in the corrupted file: $currentReader", e) +currentReader = null +InputFileBlockHolder.unset() +return false } } else { return false diff --git a/sql/core/src/test/sc