svn commit: r30417 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_22_02-eff1c50-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Fri Oct 26 05:19:04 2018 New Revision: 30417 Log: Apache Spark 2.4.1-SNAPSHOT-2018_10_25_22_02-eff1c50 docs [This commit notification would consist of 1478 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608
Repository: spark Updated Branches: refs/heads/branch-2.4 eff1c5016 -> f37bceadf [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? See the detailed information at https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be deprecated and redesigned. This patch also reverts https://github.com/apache/spark/commit/8acb51f08b448628b65e90af3b268994f9550e45 which applies to 2.4. ## How was this patch tested? Only deprecation and doc changes. Closes #22841 from rxin/SPARK-25842. Authored-by: Reynold Xin Signed-off-by: Wenchen Fan (cherry picked from commit 89d748b33c8636a1b1411c505921b0a585e1e6cb) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f37bcead Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f37bcead Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f37bcead Branch: refs/heads/branch-2.4 Commit: f37bceadf2135348c006c3d37ab7d6101cfe2267 Parents: eff1c50 Author: Reynold Xin Authored: Fri Oct 26 13:17:24 2018 +0800 Committer: Wenchen Fan Committed: Fri Oct 26 13:17:50 2018 +0800 -- python/pyspark/sql/functions.py | 30 - python/pyspark/sql/window.py| 70 +--- .../apache/spark/sql/expressions/Window.scala | 46 + .../spark/sql/expressions/WindowSpec.scala | 45 + .../scala/org/apache/spark/sql/functions.scala | 12 ++-- 5 files changed, 28 insertions(+), 175 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f37bcead/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 785e55e..9485c28 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -855,36 +855,6 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) -@since(2.4) -def unboundedPreceding(): -""" -Window function: returns the special frame boundary that represents the first row -in the window partition. -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.unboundedPreceding()) - - -@since(2.4) -def unboundedFollowing(): -""" -Window function: returns the special frame boundary that represents the last row -in the window partition. -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.unboundedFollowing()) - - -@since(2.4) -def currentRow(): -""" -Window function: returns the special frame boundary that represents the current row -in the window partition. -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.currentRow()) - - # -- Date/Timestamp functions -- @since(1.5) http://git-wip-us.apache.org/repos/asf/spark/blob/f37bcead/python/pyspark/sql/window.py -- diff --git a/python/pyspark/sql/window.py b/python/pyspark/sql/window.py index d19ced9..e76563d 100644 --- a/python/pyspark/sql/window.py +++ b/python/pyspark/sql/window.py @@ -16,11 +16,9 @@ # import sys -if sys.version >= '3': -long = int from pyspark import since, SparkContext -from pyspark.sql.column import Column, _to_seq, _to_java_column +from pyspark.sql.column import _to_seq, _to_java_column __all__ = ["Window", "WindowSpec"] @@ -126,45 +124,20 @@ class Window(object): and "5" means the five off after the current row. We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, -``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``, -``pyspark.sql.functions.unboundedFollowing`` and ``pyspark.sql.functions.currentRow`` -to specify special boundary values, rather than using integral values directly. +and ``Window.currentRow`` to specify special boundary values, rather than using integral +values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, - a column returned by ``pyspark.sql.functions.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, -a column returned by ``pyspark.sql.functions.unboundedFollowing``, or +
svn commit: r30416 - in /dev/spark/2.3.3-SNAPSHOT-2018_10_25_22_02-0a05cf9-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Fri Oct 26 05:17:42 2018 New Revision: 30416 Log: Apache Spark 2.3.3-SNAPSHOT-2018_10_25_22_02-0a05cf9 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608
Repository: spark Updated Branches: refs/heads/master 86d469aea -> 89d748b33 [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? See the detailed information at https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be deprecated and redesigned. This patch also reverts https://github.com/apache/spark/commit/8acb51f08b448628b65e90af3b268994f9550e45 which applies to 2.4. ## How was this patch tested? Only deprecation and doc changes. Closes #22841 from rxin/SPARK-25842. Authored-by: Reynold Xin Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/89d748b3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/89d748b3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/89d748b3 Branch: refs/heads/master Commit: 89d748b33c8636a1b1411c505921b0a585e1e6cb Parents: 86d469a Author: Reynold Xin Authored: Fri Oct 26 13:17:24 2018 +0800 Committer: Wenchen Fan Committed: Fri Oct 26 13:17:24 2018 +0800 -- python/pyspark/sql/functions.py | 30 - python/pyspark/sql/window.py| 70 +--- .../apache/spark/sql/expressions/Window.scala | 46 + .../spark/sql/expressions/WindowSpec.scala | 45 + .../scala/org/apache/spark/sql/functions.scala | 12 ++-- 5 files changed, 28 insertions(+), 175 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/89d748b3/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 8b2e423..739496b 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -858,36 +858,6 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) -@since(2.4) -def unboundedPreceding(): -""" -Window function: returns the special frame boundary that represents the first row -in the window partition. -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.unboundedPreceding()) - - -@since(2.4) -def unboundedFollowing(): -""" -Window function: returns the special frame boundary that represents the last row -in the window partition. -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.unboundedFollowing()) - - -@since(2.4) -def currentRow(): -""" -Window function: returns the special frame boundary that represents the current row -in the window partition. -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.currentRow()) - - # -- Date/Timestamp functions -- @since(1.5) http://git-wip-us.apache.org/repos/asf/spark/blob/89d748b3/python/pyspark/sql/window.py -- diff --git a/python/pyspark/sql/window.py b/python/pyspark/sql/window.py index d19ced9..e76563d 100644 --- a/python/pyspark/sql/window.py +++ b/python/pyspark/sql/window.py @@ -16,11 +16,9 @@ # import sys -if sys.version >= '3': -long = int from pyspark import since, SparkContext -from pyspark.sql.column import Column, _to_seq, _to_java_column +from pyspark.sql.column import _to_seq, _to_java_column __all__ = ["Window", "WindowSpec"] @@ -126,45 +124,20 @@ class Window(object): and "5" means the five off after the current row. We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, -``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``, -``pyspark.sql.functions.unboundedFollowing`` and ``pyspark.sql.functions.currentRow`` -to specify special boundary values, rather than using integral values directly. +and ``Window.currentRow`` to specify special boundary values, rather than using integral +values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, - a column returned by ``pyspark.sql.functions.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, -a column returned by ``pyspark.sql.functions.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, or any value
spark git commit: [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker
Repository: spark Updated Branches: refs/heads/branch-2.3 8fbf3ee91 -> 0a05cf917 [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker ## What changes were proposed in this pull request? There is a race condition when releasing a Python worker. If `ReaderIterator.handleEndOfDataSection` is not running in the task thread, when a task is early terminated (such as `take(N)`), the task completion listener may close the worker but "handleEndOfDataSection" can still put the worker into the worker pool to reuse. https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 is a patch to reproduce this issue. I also found a user reported this in the mail list: http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E This PR fixes the issue by using `compareAndSet` to make sure we will never return a closed worker to the work pool. ## How was this patch tested? Jenkins. Closes #22816 from zsxwing/fix-socket-closed. Authored-by: Shixiong Zhu Signed-off-by: Takuya UESHIN (cherry picked from commit 86d469aeaa492c0642db09b27bb0879ead5d7166) Signed-off-by: Takuya UESHIN Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a05cf91 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a05cf91 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a05cf91 Branch: refs/heads/branch-2.3 Commit: 0a05cf917a2f701def38968e59217a98b6da8d8b Parents: 8fbf3ee Author: Shixiong Zhu Authored: Fri Oct 26 13:53:51 2018 +0900 Committer: Takuya UESHIN Committed: Fri Oct 26 13:54:55 2018 +0900 -- .../apache/spark/api/python/PythonRunner.scala | 21 ++-- .../execution/python/ArrowPythonRunner.scala| 4 ++-- .../sql/execution/python/PythonUDFRunner.scala | 4 ++-- 3 files changed, 15 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0a05cf91/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala index 754a654..f7d1461 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala @@ -84,15 +84,17 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( envVars.put("SPARK_REUSE_WORKER", "1") } val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap) -// Whether is the worker released into idle pool -val released = new AtomicBoolean(false) +// Whether is the worker released into idle pool or closed. When any codes try to release or +// close a worker, they should use `releasedOrClosed.compareAndSet` to flip the state to make +// sure there is only one winner that is going to release or close the worker. +val releasedOrClosed = new AtomicBoolean(false) // Start a thread to feed the process input from our parent's iterator val writerThread = newWriterThread(env, worker, inputIterator, partitionIndex, context) context.addTaskCompletionListener { _ => writerThread.shutdownOnTaskCompletion() - if (!reuseWorker || !released.get) { + if (!reuseWorker || releasedOrClosed.compareAndSet(false, true)) { try { worker.close() } catch { @@ -109,7 +111,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( val stream = new DataInputStream(new BufferedInputStream(worker.getInputStream, bufferSize)) val stdoutIterator = newReaderIterator( - stream, writerThread, startTime, env, worker, released, context) + stream, writerThread, startTime, env, worker, releasedOrClosed, context) new InterruptibleIterator(context, stdoutIterator) } @@ -126,7 +128,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( startTime: Long, env: SparkEnv, worker: Socket, - released: AtomicBoolean, + releasedOrClosed: AtomicBoolean, context: TaskContext): Iterator[OUT] /** @@ -272,7 +274,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( startTime: Long, env: SparkEnv, worker: Socket, - released: AtomicBoolean, + releasedOrClosed: AtomicBoolean, context: TaskContext) extends Iterator[OUT] { @@ -343,9 +345,8 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( } // Check whether the worker is ready to be re-used. if (stream.readInt() == SpecialLengths.END_OF_STREAM) { -if (reuseWorker) { +if (reuseWorker &&
spark git commit: [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker
Repository: spark Updated Branches: refs/heads/branch-2.4 adfd1057d -> eff1c5016 [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker ## What changes were proposed in this pull request? There is a race condition when releasing a Python worker. If `ReaderIterator.handleEndOfDataSection` is not running in the task thread, when a task is early terminated (such as `take(N)`), the task completion listener may close the worker but "handleEndOfDataSection" can still put the worker into the worker pool to reuse. https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 is a patch to reproduce this issue. I also found a user reported this in the mail list: http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E This PR fixes the issue by using `compareAndSet` to make sure we will never return a closed worker to the work pool. ## How was this patch tested? Jenkins. Closes #22816 from zsxwing/fix-socket-closed. Authored-by: Shixiong Zhu Signed-off-by: Takuya UESHIN (cherry picked from commit 86d469aeaa492c0642db09b27bb0879ead5d7166) Signed-off-by: Takuya UESHIN Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eff1c501 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eff1c501 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eff1c501 Branch: refs/heads/branch-2.4 Commit: eff1c5016cc78ad3d26e5e83cd0d72c56c35cf0a Parents: adfd105 Author: Shixiong Zhu Authored: Fri Oct 26 13:53:51 2018 +0900 Committer: Takuya UESHIN Committed: Fri Oct 26 13:54:14 2018 +0900 -- .../apache/spark/api/python/PythonRunner.scala | 21 ++-- .../execution/python/ArrowPythonRunner.scala| 4 ++-- .../sql/execution/python/PythonUDFRunner.scala | 4 ++-- 3 files changed, 15 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/eff1c501/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala index 6e53a04..f73e95e 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala @@ -106,15 +106,17 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( envVars.put("PYSPARK_EXECUTOR_MEMORY_MB", memoryMb.get.toString) } val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap) -// Whether is the worker released into idle pool -val released = new AtomicBoolean(false) +// Whether is the worker released into idle pool or closed. When any codes try to release or +// close a worker, they should use `releasedOrClosed.compareAndSet` to flip the state to make +// sure there is only one winner that is going to release or close the worker. +val releasedOrClosed = new AtomicBoolean(false) // Start a thread to feed the process input from our parent's iterator val writerThread = newWriterThread(env, worker, inputIterator, partitionIndex, context) context.addTaskCompletionListener[Unit] { _ => writerThread.shutdownOnTaskCompletion() - if (!reuseWorker || !released.get) { + if (!reuseWorker || releasedOrClosed.compareAndSet(false, true)) { try { worker.close() } catch { @@ -131,7 +133,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( val stream = new DataInputStream(new BufferedInputStream(worker.getInputStream, bufferSize)) val stdoutIterator = newReaderIterator( - stream, writerThread, startTime, env, worker, released, context) + stream, writerThread, startTime, env, worker, releasedOrClosed, context) new InterruptibleIterator(context, stdoutIterator) } @@ -148,7 +150,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( startTime: Long, env: SparkEnv, worker: Socket, - released: AtomicBoolean, + releasedOrClosed: AtomicBoolean, context: TaskContext): Iterator[OUT] /** @@ -392,7 +394,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( startTime: Long, env: SparkEnv, worker: Socket, - released: AtomicBoolean, + releasedOrClosed: AtomicBoolean, context: TaskContext) extends Iterator[OUT] { @@ -463,9 +465,8 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( } // Check whether the worker is ready to be re-used. if (stream.readInt() == SpecialLengths.END_OF_STREAM) { -if (reuseWorker) { +if (reuseWorker &&
spark git commit: [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker
Repository: spark Updated Branches: refs/heads/master 24e8c27df -> 86d469aea [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker ## What changes were proposed in this pull request? There is a race condition when releasing a Python worker. If `ReaderIterator.handleEndOfDataSection` is not running in the task thread, when a task is early terminated (such as `take(N)`), the task completion listener may close the worker but "handleEndOfDataSection" can still put the worker into the worker pool to reuse. https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 is a patch to reproduce this issue. I also found a user reported this in the mail list: http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E This PR fixes the issue by using `compareAndSet` to make sure we will never return a closed worker to the work pool. ## How was this patch tested? Jenkins. Closes #22816 from zsxwing/fix-socket-closed. Authored-by: Shixiong Zhu Signed-off-by: Takuya UESHIN Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86d469ae Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86d469ae Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86d469ae Branch: refs/heads/master Commit: 86d469aeaa492c0642db09b27bb0879ead5d7166 Parents: 24e8c27 Author: Shixiong Zhu Authored: Fri Oct 26 13:53:51 2018 +0900 Committer: Takuya UESHIN Committed: Fri Oct 26 13:53:51 2018 +0900 -- .../apache/spark/api/python/PythonRunner.scala | 21 ++-- .../execution/python/ArrowPythonRunner.scala| 4 ++-- .../sql/execution/python/PythonUDFRunner.scala | 4 ++-- 3 files changed, 15 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/86d469ae/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala index 6e53a04..f73e95e 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala @@ -106,15 +106,17 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( envVars.put("PYSPARK_EXECUTOR_MEMORY_MB", memoryMb.get.toString) } val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap) -// Whether is the worker released into idle pool -val released = new AtomicBoolean(false) +// Whether is the worker released into idle pool or closed. When any codes try to release or +// close a worker, they should use `releasedOrClosed.compareAndSet` to flip the state to make +// sure there is only one winner that is going to release or close the worker. +val releasedOrClosed = new AtomicBoolean(false) // Start a thread to feed the process input from our parent's iterator val writerThread = newWriterThread(env, worker, inputIterator, partitionIndex, context) context.addTaskCompletionListener[Unit] { _ => writerThread.shutdownOnTaskCompletion() - if (!reuseWorker || !released.get) { + if (!reuseWorker || releasedOrClosed.compareAndSet(false, true)) { try { worker.close() } catch { @@ -131,7 +133,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( val stream = new DataInputStream(new BufferedInputStream(worker.getInputStream, bufferSize)) val stdoutIterator = newReaderIterator( - stream, writerThread, startTime, env, worker, released, context) + stream, writerThread, startTime, env, worker, releasedOrClosed, context) new InterruptibleIterator(context, stdoutIterator) } @@ -148,7 +150,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( startTime: Long, env: SparkEnv, worker: Socket, - released: AtomicBoolean, + releasedOrClosed: AtomicBoolean, context: TaskContext): Iterator[OUT] /** @@ -392,7 +394,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( startTime: Long, env: SparkEnv, worker: Socket, - released: AtomicBoolean, + releasedOrClosed: AtomicBoolean, context: TaskContext) extends Iterator[OUT] { @@ -463,9 +465,8 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( } // Check whether the worker is ready to be re-used. if (stream.readInt() == SpecialLengths.END_OF_STREAM) { -if (reuseWorker) { +if (reuseWorker && releasedOrClosed.compareAndSet(false, true)) { env.releasePythonWorker(pythonExec,
spark git commit: [SPARK-25819][SQL] Support parse mode option for the function `from_avro`
Repository: spark Updated Branches: refs/heads/master 79f3babcc -> 24e8c27df [SPARK-25819][SQL] Support parse mode option for the function `from_avro` ## What changes were proposed in this pull request? Current the function `from_avro` throws exception on reading corrupt records. In practice, there could be various reasons of data corruption. It would be good to support `PERMISSIVE` mode and allow the function from_avro to process all the input file/streaming, which is consistent with from_json and from_csv. There is no obvious down side for supporting `PERMISSIVE` mode. Different from `from_csv` and `from_json`, the default parse mode is `FAILFAST` for the following reasons: 1. Since Avro is structured data format, input data is usually able to be parsed by certain schema. In such case, exposing the problems of input data to users is better than hiding it. 2. For `PERMISSIVE` mode, we have to force the data schema as fully nullable. This seems quite unnecessary for Avro. Reversing non-null schema might archive more perf optimizations in Spark. 3. To be consistent with the behavior in Spark 2.4 . ## How was this patch tested? Unit test Manual previewing generated html for the Avro data source doc: ![image](https://user-images.githubusercontent.com/1097932/47510100-02558880-d8aa-11e8-9d57-a43daee4c6b9.png) Closes #22814 from gengliangwang/improve_from_avro. Authored-by: Gengliang Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/24e8c27d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/24e8c27d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/24e8c27d Branch: refs/heads/master Commit: 24e8c27dfe31e6e0a53c89e6ddc36327e537931b Parents: 79f3bab Author: Gengliang Wang Authored: Fri Oct 26 11:39:38 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 26 11:39:38 2018 +0800 -- docs/sql-data-sources-avro.md | 18 +++- .../spark/sql/avro/AvroDataToCatalyst.scala | 90 +--- .../org/apache/spark/sql/avro/AvroOptions.scala | 16 +++- .../org/apache/spark/sql/avro/package.scala | 28 +- .../avro/AvroCatalystDataConversionSuite.scala | 58 +++-- .../spark/sql/avro/AvroFunctionsSuite.scala | 36 +++- 6 files changed, 219 insertions(+), 27 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/docs/sql-data-sources-avro.md -- diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md index d3b81f0..bfe641d 100644 --- a/docs/sql-data-sources-avro.md +++ b/docs/sql-data-sources-avro.md @@ -142,7 +142,10 @@ StreamingQuery query = output ## Data Source Option -Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`. +Data source options of Avro can be set via: + * the `.option` method on `DataFrameReader` or `DataFrameWriter`. + * the `options` parameter in function `from_avro`. + Property NameDefaultMeaningScope @@ -177,6 +180,19 @@ Data source options of Avro can be set using the `.option` method on `DataFrameR Currently supported codecs are uncompressed, snappy, deflate, bzip2 and xz. If the option is not set, the configuration spark.sql.avro.compression.codec config is taken into account. write + +mode +FAILFAST +The mode option allows to specify parse mode for function from_avro. + Currently supported modes are: + +FAILFAST: Throws an exception on processing corrupted record. +PERMISSIVE: Corrupt records are processed as null result. Therefore, the +data schema is forced to be fully nullable, which might be different from the one user provided. + + +function from_avro + ## Configuration http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala -- diff --git a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala index 915769f..43d3f6e 100644 --- a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala +++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala @@ -17,20 +17,37 @@ package org.apache.spark.sql.avro +import scala.util.control.NonFatal + import org.apache.avro.Schema import org.apache.avro.generic.GenericDatumReader import org.apache.avro.io.{BinaryDecoder, DecoderFactory} -import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import
spark git commit: [MINOR][TEST][BRANCH-2.4] Regenerate golden file `datetime.sql.out`
Repository: spark Updated Branches: refs/heads/branch-2.4 b739fb0d7 -> adfd1057d [MINOR][TEST][BRANCH-2.4] Regenerate golden file `datetime.sql.out` ## What changes were proposed in this pull request? `datetime.sql.out` is a generated golden file, but it's a little bit broken during manual [reverting](https://github.com/dongjoon-hyun/spark/commit/5d744499667fcd08825bca0ac6d5d90d6e110ebc#diff-79dd276be45ede6f34e24ad7005b0a7cR87). This doens't cause test failure because the difference is inside `comments` and blank lines. We had better fix this minor issue before RC5. ## How was this patch tested? Pass the Jenkins. Closes #22837 from dongjoon-hyun/fix_datetime_sql_out. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/adfd1057 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/adfd1057 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/adfd1057 Branch: refs/heads/branch-2.4 Commit: adfd1057dae3b48c05d7443e3aee23157965e6d1 Parents: b739fb0 Author: Dongjoon Hyun Authored: Thu Oct 25 20:37:07 2018 -0700 Committer: Dongjoon Hyun Committed: Thu Oct 25 20:37:07 2018 -0700 -- sql/core/src/test/resources/sql-tests/results/datetime.sql.out | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/adfd1057/sql/core/src/test/resources/sql-tests/results/datetime.sql.out -- diff --git a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out index 4e1cfa6..63aa004 100644 --- a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out @@ -82,9 +82,10 @@ struct 1 2 2 3 + -- !query 9 select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15') --- !query 3 schema +-- !query 9 schema struct --- !query 3 output +-- !query 9 output 5 3 5 NULL4 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary
Repository: spark Updated Branches: refs/heads/master dc9b32080 -> 79f3babcc [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary ## What changes were proposed in this pull request? We vote for the artifacts. All releases are in the form of the source materials needed to make changes to the software being released. (http://www.apache.org/legal/release-policy.html#artifacts) >From Spark 2.4.0, the source artifact and binary artifact starts to contain >own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. >However, unfortunately, `dev/make-distribution.sh` inside source artifacts >start to fail because it expects `LICENSE-binary` and source artifact have >only the LICENSE file. https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz `dev/make-distribution.sh` is used during the voting phase because we are voting on that source artifact instead of GitHub repository. Individual contributors usually don't have the downstream repository and starts to try build the voting source artifacts to help the verification for the source artifact during voting phase. (Personally, I did before.) This PR aims to recover that script to work in any way. This doesn't aim for source artifacts to reproduce the compiled artifacts. ## How was this patch tested? Manual. ``` $ rm LICENSE-binary $ dev/make-distribution.sh ``` Closes #22840 from dongjoon-hyun/SPARK-25840. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79f3babc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79f3babc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79f3babc Branch: refs/heads/master Commit: 79f3babcc6e189d7405464b9ac1eb1c017e51f5d Parents: dc9b320 Author: Dongjoon Hyun Authored: Thu Oct 25 20:26:13 2018 -0700 Committer: Dongjoon Hyun Committed: Thu Oct 25 20:26:13 2018 -0700 -- dev/make-distribution.sh | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/79f3babc/dev/make-distribution.sh -- diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh index 668682f..84f4ae9 100755 --- a/dev/make-distribution.sh +++ b/dev/make-distribution.sh @@ -212,9 +212,13 @@ mkdir -p "$DISTDIR/examples/src/main" cp -r "$SPARK_HOME/examples/src/main" "$DISTDIR/examples/src/" # Copy license and ASF files -cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE" -cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses" -cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE" +if [ -e "$SPARK_HOME/LICENSE-binary" ]; then + cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE" + cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses" + cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE" +else + echo "Skipping copying LICENSE files" +fi if [ -e "$SPARK_HOME/CHANGES.txt" ]; then cp "$SPARK_HOME/CHANGES.txt" "$DISTDIR" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary
Repository: spark Updated Branches: refs/heads/branch-2.4 39e108f16 -> b739fb0d7 [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary ## What changes were proposed in this pull request? We vote for the artifacts. All releases are in the form of the source materials needed to make changes to the software being released. (http://www.apache.org/legal/release-policy.html#artifacts) >From Spark 2.4.0, the source artifact and binary artifact starts to contain >own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. >However, unfortunately, `dev/make-distribution.sh` inside source artifacts >start to fail because it expects `LICENSE-binary` and source artifact have >only the LICENSE file. https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz `dev/make-distribution.sh` is used during the voting phase because we are voting on that source artifact instead of GitHub repository. Individual contributors usually don't have the downstream repository and starts to try build the voting source artifacts to help the verification for the source artifact during voting phase. (Personally, I did before.) This PR aims to recover that script to work in any way. This doesn't aim for source artifacts to reproduce the compiled artifacts. ## How was this patch tested? Manual. ``` $ rm LICENSE-binary $ dev/make-distribution.sh ``` Closes #22840 from dongjoon-hyun/SPARK-25840. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 79f3babcc6e189d7405464b9ac1eb1c017e51f5d) Signed-off-by: Dongjoon Hyun Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b739fb0d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b739fb0d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b739fb0d Branch: refs/heads/branch-2.4 Commit: b739fb0d783adad68e7197caaa931a83eb1725bd Parents: 39e108f Author: Dongjoon Hyun Authored: Thu Oct 25 20:26:13 2018 -0700 Committer: Dongjoon Hyun Committed: Thu Oct 25 20:26:26 2018 -0700 -- dev/make-distribution.sh | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b739fb0d/dev/make-distribution.sh -- diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh index 668682f..84f4ae9 100755 --- a/dev/make-distribution.sh +++ b/dev/make-distribution.sh @@ -212,9 +212,13 @@ mkdir -p "$DISTDIR/examples/src/main" cp -r "$SPARK_HOME/examples/src/main" "$DISTDIR/examples/src/" # Copy license and ASF files -cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE" -cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses" -cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE" +if [ -e "$SPARK_HOME/LICENSE-binary" ]; then + cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE" + cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses" + cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE" +else + echo "Skipping copying LICENSE files" +fi if [ -e "$SPARK_HOME/CHANGES.txt" ]; then cp "$SPARK_HOME/CHANGES.txt" "$DISTDIR" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r30415 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_20_03-72a23a6-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Fri Oct 26 03:17:26 2018 New Revision: 30415 Log: Apache Spark 3.0.0-SNAPSHOT-2018_10_25_20_03-72a23a6 docs [This commit notification would consist of 1474 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0
Repository: spark Updated Branches: refs/heads/branch-2.4 a9f200e11 -> 39e108f16 [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0 ## What changes were proposed in this pull request? The following code in BisectingKMeansModel.load calls the wrong version of load. ``` case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV1_0.load(sc, path) ``` Closes #22790 from huaxingao/spark-25793. Authored-by: Huaxin Gao Signed-off-by: Wenchen Fan (cherry picked from commit dc9b320807881403ca9f1e2e6d01de4b52db3975) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39e108f1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39e108f1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39e108f1 Branch: refs/heads/branch-2.4 Commit: 39e108f168128abc5e0f369645b699a87ce6c91b Parents: a9f200e Author: Huaxin Gao Authored: Fri Oct 26 11:07:55 2018 +0800 Committer: Wenchen Fan Committed: Fri Oct 26 11:08:51 2018 +0800 -- .../apache/spark/mllib/clustering/BisectingKMeansModel.scala | 6 +++--- .../apache/spark/mllib/clustering/BisectingKMeansSuite.scala | 3 ++- 2 files changed, 5 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/39e108f1/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala index 9d115af..4c5794f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala @@ -109,10 +109,10 @@ class BisectingKMeansModel private[clustering] ( @Since("2.0.0") override def save(sc: SparkContext, path: String): Unit = { -BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path) +BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path) } - override protected def formatVersion: String = "1.0" + override protected def formatVersion: String = "2.0" } @Since("2.0.0") @@ -126,7 +126,7 @@ object BisectingKMeansModel extends Loader[BisectingKMeansModel] { val model = SaveLoadV1_0.load(sc, path) model case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => -val model = SaveLoadV1_0.load(sc, path) +val model = SaveLoadV2_0.load(sc, path) model case _ => throw new Exception( s"BisectingKMeansModel.load did not recognize model with (className, format version):" + http://git-wip-us.apache.org/repos/asf/spark/blob/39e108f1/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala index 35f7932..4a4d8b5 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala @@ -187,11 +187,12 @@ class BisectingKMeansSuite extends SparkFunSuite with MLlibTestSparkContext { val points = (1 until 8).map(i => Vectors.dense(i)) val data = sc.parallelize(points, 2) -val model = new BisectingKMeans().run(data) +val model = new BisectingKMeans().setDistanceMeasure(DistanceMeasure.COSINE).run(data) try { model.save(sc, path) val sameModel = BisectingKMeansModel.load(sc, path) assert(model.k === sameModel.k) + assert(model.distanceMeasure === sameModel.distanceMeasure) model.clusterCenters.zip(sameModel.clusterCenters).foreach(c => c._1 === c._2) } finally { Utils.deleteRecursively(tempDir) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0
Repository: spark Updated Branches: refs/heads/master 72a23a6c4 -> dc9b32080 [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0 ## What changes were proposed in this pull request? The following code in BisectingKMeansModel.load calls the wrong version of load. ``` case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV1_0.load(sc, path) ``` Closes #22790 from huaxingao/spark-25793. Authored-by: Huaxin Gao Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc9b3208 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc9b3208 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc9b3208 Branch: refs/heads/master Commit: dc9b320807881403ca9f1e2e6d01de4b52db3975 Parents: 72a23a6 Author: Huaxin Gao Authored: Fri Oct 26 11:07:55 2018 +0800 Committer: Wenchen Fan Committed: Fri Oct 26 11:07:55 2018 +0800 -- .../apache/spark/mllib/clustering/BisectingKMeansModel.scala | 6 +++--- .../apache/spark/mllib/clustering/BisectingKMeansSuite.scala | 3 ++- 2 files changed, 5 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dc9b3208/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala index 9d115af..4c5794f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala @@ -109,10 +109,10 @@ class BisectingKMeansModel private[clustering] ( @Since("2.0.0") override def save(sc: SparkContext, path: String): Unit = { -BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path) +BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path) } - override protected def formatVersion: String = "1.0" + override protected def formatVersion: String = "2.0" } @Since("2.0.0") @@ -126,7 +126,7 @@ object BisectingKMeansModel extends Loader[BisectingKMeansModel] { val model = SaveLoadV1_0.load(sc, path) model case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => -val model = SaveLoadV1_0.load(sc, path) +val model = SaveLoadV2_0.load(sc, path) model case _ => throw new Exception( s"BisectingKMeansModel.load did not recognize model with (className, format version):" + http://git-wip-us.apache.org/repos/asf/spark/blob/dc9b3208/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala index 35f7932..4a4d8b5 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala @@ -187,11 +187,12 @@ class BisectingKMeansSuite extends SparkFunSuite with MLlibTestSparkContext { val points = (1 until 8).map(i => Vectors.dense(i)) val data = sc.parallelize(points, 2) -val model = new BisectingKMeans().run(data) +val model = new BisectingKMeans().setDistanceMeasure(DistanceMeasure.COSINE).run(data) try { model.save(sc, path) val sameModel = BisectingKMeansModel.load(sc, path) assert(model.k === sameModel.k) + assert(model.distanceMeasure === sameModel.distanceMeasure) model.clusterCenters.zip(sameModel.clusterCenters).foreach(c => c._1 === c._2) } finally { Utils.deleteRecursively(tempDir) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25772][SQL][FOLLOWUP] remove GetArrayFromMap
Repository: spark Updated Branches: refs/heads/master 46d2d2c74 -> 72a23a6c4 [SPARK-25772][SQL][FOLLOWUP] remove GetArrayFromMap ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/22745 we introduced the `GetArrayFromMap` expression. Later on I realized this is duplicated as we already have `MapKeys` and `MapValues`. This PR removes `GetArrayFromMap` ## How was this patch tested? existing tests Closes #22825 from cloud-fan/minor. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72a23a6c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72a23a6c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72a23a6c Branch: refs/heads/master Commit: 72a23a6c43fe1b5a6583ea6b35b4fbb08474abbe Parents: 46d2d2c Author: Wenchen Fan Authored: Fri Oct 26 10:19:35 2018 +0800 Committer: Wenchen Fan Committed: Fri Oct 26 10:19:35 2018 +0800 -- .../spark/sql/catalyst/JavaTypeInference.scala | 6 +- .../catalyst/expressions/objects/objects.scala | 75 2 files changed, 3 insertions(+), 78 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/72a23a6c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala index f32e080..8ef8b2b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala @@ -26,7 +26,7 @@ import scala.language.existentials import com.google.common.reflect.TypeToken -import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, UnresolvedAttribute, UnresolvedExtractValue} +import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, UnresolvedExtractValue} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.objects._ import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} @@ -280,7 +280,7 @@ object JavaTypeInference { Invoke( UnresolvedMapObjects( p => deserializerFor(keyType, p), - GetKeyArrayFromMap(path)), + MapKeys(path)), "array", ObjectType(classOf[Array[Any]])) @@ -288,7 +288,7 @@ object JavaTypeInference { Invoke( UnresolvedMapObjects( p => deserializerFor(valueType, p), - GetValueArrayFromMap(path)), + MapValues(path)), "array", ObjectType(classOf[Array[Any]])) http://git-wip-us.apache.org/repos/asf/spark/blob/72a23a6c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala index 5bfa485..b6f9b47 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala @@ -1788,78 +1788,3 @@ case class ValidateExternalType(child: Expression, expected: DataType) ev.copy(code = code, isNull = input.isNull) } } - -object GetKeyArrayFromMap { - - /** - * Construct an instance of GetArrayFromMap case class - * extracting a key array from a Map expression. - * - * @param child a Map expression to extract a key array from - */ - def apply(child: Expression): Expression = { -GetArrayFromMap( - child, - "keyArray", - _.keyArray(), - { case MapType(kt, _, _) => kt }) - } -} - -object GetValueArrayFromMap { - - /** - * Construct an instance of GetArrayFromMap case class - * extracting a value array from a Map expression. - * - * @param child a Map expression to extract a value array from - */ - def apply(child: Expression): Expression = { -GetArrayFromMap( - child, - "valueArray", - _.valueArray(), - { case MapType(_, vt, _) => vt }) - } -} - -/** - * Extracts a key/value array from a Map expression. - * - * @param child a Map expression to extract an array from - * @param functionName name of the function that is invoked to extract an array - * @param arrayGetter function extracting `ArrayData` from `MapData` - * @param elementTypeGetter function
svn commit: r30414 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_18_02-a9f200e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Fri Oct 26 01:16:58 2018 New Revision: 30414 Log: Apache Spark 2.4.1-SNAPSHOT-2018_10_25_18_02-a9f200e docs [This commit notification would consist of 1478 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions
Repository: spark Updated Branches: refs/heads/branch-2.4 db121a2a1 -> a9f200e11 [SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions ## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: https://github.com/apache/spark/pull/21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: https://github.com/apache/spark/pull/21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: https://github.com/apache/spark/pull/22045 - Revert [SPARK-23939][SQL] Add transform_keys function: https://github.com/apache/spark/pull/22013 - Revert [SPARK-23938][SQL] Add map_zip_with function: https://github.com/apache/spark/pull/22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: https://github.com/apache/spark/pull/21434/ ## How was this patch tested? The existing tests. Closes #22827 from gatorsmile/revertMap2.4. Authored-by: gatorsmile Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a9f200e1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a9f200e1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a9f200e1 Branch: refs/heads/branch-2.4 Commit: a9f200e11da3d26158f9f75e48756d47d61bfacb Parents: db121a2 Author: gatorsmile Authored: Fri Oct 26 07:38:55 2018 +0800 Committer: Wenchen Fan Committed: Fri Oct 26 07:38:55 2018 +0800 -- R/pkg/NAMESPACE | 1 - R/pkg/R/functions.R | 15 +- R/pkg/R/generics.R | 4 - R/pkg/tests/fulltests/test_sparkSQL.R | 7 +- python/pyspark/sql/functions.py | 20 - .../catalyst/analysis/FunctionRegistry.scala| 5 - .../sql/catalyst/analysis/TypeCoercion.scala| 25 -- .../expressions/collectionOperations.scala | 168 .../expressions/higherOrderFunctions.scala | 330 -- .../CollectionExpressionsSuite.scala| 24 -- .../expressions/HigherOrderFunctionsSuite.scala | 315 -- .../scala/org/apache/spark/sql/functions.scala | 7 - .../sql-tests/inputs/higher-order-functions.sql | 23 - .../inputs/typeCoercion/native/mapZipWith.sql | 78 .../results/higher-order-functions.sql.out | 66 +-- .../typeCoercion/native/mapZipWith.sql.out | 179 .../spark/sql/DataFrameFunctionsSuite.scala | 425 --- 17 files changed, 3 insertions(+), 1689 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a9f200e1/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 96ff389..d77c62a 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -313,7 +313,6 @@ exportMethods("%<=>%", "lower", "lpad", "ltrim", - "map_entries", "map_from_arrays", "map_keys", "map_values", http://git-wip-us.apache.org/repos/asf/spark/blob/a9f200e1/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 63bd427..1e70244 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -219,7 +219,7 @@ NULL #' head(select(tmp, sort_array(tmp$v1))) #' head(select(tmp, sort_array(tmp$v1, asc = FALSE))) #' tmp3 <- mutate(df, v3 = create_map(df$model, df$cyl)) -#' head(select(tmp3, map_entries(tmp3$v3), map_keys(tmp3$v3), map_values(tmp3$v3))) +#' head(select(tmp3, map_keys(tmp3$v3), map_values(tmp3$v3))) #' head(select(tmp3, element_at(tmp3$v3, "Valiant"))) #' tmp4 <- mutate(df, v4 = create_array(df$mpg, df$cyl), v5 = create_array(df$cyl, df$hp)) #' head(select(tmp4, concat(tmp4$v4, tmp4$v5), arrays_overlap(tmp4$v4, tmp4$v5))) @@ -3253,19 +3253,6 @@ setMethod("flatten", }) #' @details -#' \code{map_entries}: Returns an unordered array of all entries in the given map. -#' -#' @rdname column_collection_functions -#' @aliases map_entries map_entries,Column-method -#' @note map_entries since 2.4.0 -setMethod("map_entries", - signature(x = "Column"), - function(x) { -jc <- callJStatic("org.apache.spark.sql.functions", "map_entries", x@jc) -column(jc) - }) - -#' @details #' \code{map_from_arrays}: Creates a new map column. The array in the first column is used for #' keys. The array in the second column is used for values. All elements in the array for key #' should not be null. http://git-wip-us.apache.org/repos/asf/spark/blob/a9f200e1/R/pkg/R/generics.R -- diff
svn commit: r30413 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_16_03-46d2d2c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Oct 25 23:17:26 2018 New Revision: 30413 Log: Apache Spark 3.0.0-SNAPSHOT-2018_10_25_16_03-46d2d2c docs [This commit notification would consist of 1474 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r30411 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_14_02-45ed76d-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Oct 25 21:17:10 2018 New Revision: 30411 Log: Apache Spark 2.4.1-SNAPSHOT-2018_10_25_14_02-45ed76d docs [This commit notification would consist of 1478 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra data source options
Repository: spark Updated Branches: refs/heads/branch-2.4 1b075f26f -> db121a2a1 [SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra data source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](https://github.com/apache/spark/pull/22622#discussion_r222911529), this PR aims to add more detailed information and examples. This is a backport of #22801. `orc.column.encoding.direct` is removed since it's not supported in ORC 1.5.2. ## How was this patch tested? Manual. Closes #22839 from dongjoon-hyun/SPARK-25656-2.4. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db121a2a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db121a2a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db121a2a Branch: refs/heads/branch-2.4 Commit: db121a2a1fde96fe77eedff18706df5c8e2e731d Parents: 1b075f2 Author: Dongjoon Hyun Authored: Thu Oct 25 14:15:03 2018 -0700 Committer: Dongjoon Hyun Committed: Thu Oct 25 14:15:03 2018 -0700 -- docs/sql-data-sources-load-save-functions.md| 43 +++ .../examples/sql/JavaSQLDataSourceExample.java | 6 +++ examples/src/main/python/sql/datasource.py | 8 examples/src/main/r/RSparkSQLExample.R | 6 ++- examples/src/main/resources/users.orc | Bin 0 -> 547 bytes .../examples/sql/SQLDataSourceExample.scala | 6 +++ 6 files changed, 68 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/db121a2a/docs/sql-data-sources-load-save-functions.md -- diff --git a/docs/sql-data-sources-load-save-functions.md b/docs/sql-data-sources-load-save-functions.md index e1dd0a3..a3191b2 100644 --- a/docs/sql-data-sources-load-save-functions.md +++ b/docs/sql-data-sources-load-save-functions.md @@ -82,6 +82,49 @@ To load a CSV file you can use: +The extra options are also used during write operation. +For example, you can control bloom filters and dictionary encodings for ORC data sources. +The following ORC example will create bloom filter on `favorite_color` and use dictionary encoding for `name` and `favorite_color`. +For Parquet, there exists `parquet.enable.dictionary`, too. +To find more detailed information about the extra ORC/Parquet options, +visit the official Apache ORC/Parquet websites. + + + + +{% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example manual_save_options_orc python/sql/datasource.py %} + + + +{% include_example manual_save_options_orc r/RSparkSQLExample.R %} + + + + +{% highlight sql %} +CREATE TABLE users_with_options ( + name STRING, + favorite_color STRING, + favorite_numbers array +) USING ORC +OPTIONS ( + orc.bloom.filter.columns 'favorite_color', + orc.dictionary.key.threshold '1.0' +) +{% endhighlight %} + + + + + ### Run SQL on files directly Instead of using read API to load a file into DataFrame and query it, you can also query that http://git-wip-us.apache.org/repos/asf/spark/blob/db121a2a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java index ef3c904..97e9ca3 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java @@ -123,6 +123,12 @@ public class JavaSQLDataSourceExample { .option("header", "true") .load("examples/src/main/resources/people.csv"); // $example off:manual_load_options_csv$ +// $example on:manual_save_options_orc$ +usersDF.write().format("orc") + .option("orc.bloom.filter.columns", "favorite_color") + .option("orc.dictionary.key.threshold", "1.0") + .save("users_with_options.orc"); +// $example off:manual_save_options_orc$ // $example on:direct_sql$ Dataset sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"); http://git-wip-us.apache.org/repos/asf/spark/blob/db121a2a/examples/src/main/python/sql/datasource.py -- diff --git
spark git commit: [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs
Repository: spark Updated Branches: refs/heads/branch-2.4 45ed76d6f -> 1b075f26f [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs ## What changes were proposed in this pull request? `hsync` has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. `hsync` uses the force `FileChannel.force` to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events. I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api `DFSInputStream.getFileLength()` which gives the file length including the `lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui. ## How was this patch tested? Added new test and verified manually, with the added conf `spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data. Closes #22752 from devaraj-kavali/SPARK-24787. Authored-by: Devaraj K Signed-off-by: Marcelo Vanzin (cherry picked from commit 46d2d2c74d9aaf30e158aeda58a189f6c8e48b9c) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b075f26 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b075f26 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b075f26 Branch: refs/heads/branch-2.4 Commit: 1b075f26f2aa3db93e168c8b8bb5d67e96ffb490 Parents: 45ed76d Author: Devaraj K Authored: Thu Oct 25 13:16:08 2018 -0700 Committer: Marcelo Vanzin Committed: Thu Oct 25 13:16:18 2018 -0700 -- .../deploy/history/FsHistoryProvider.scala | 22 ++-- .../spark/scheduler/EventLoggingListener.scala | 8 + .../deploy/history/FsHistoryProviderSuite.scala | 37 ++-- 3 files changed, 56 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1b075f26/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala index c23a659..c4517d3 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala @@ -34,7 +34,7 @@ import com.fasterxml.jackson.annotation.JsonIgnore import com.google.common.io.ByteStreams import com.google.common.util.concurrent.MoreExecutors import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} -import org.apache.hadoop.hdfs.DistributedFileSystem +import org.apache.hadoop.hdfs.{DFSInputStream, DistributedFileSystem} import org.apache.hadoop.hdfs.protocol.HdfsConstants import org.apache.hadoop.security.AccessControlException import org.fusesource.leveldbjni.internal.NativeDB @@ -449,7 +449,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen())) } -if (info.fileSize < entry.getLen()) { +if (shouldReloadLog(info, entry)) { if (info.appId.isDefined && fastInProgressParsing) { // When fast in-progress parsing is on, we don't need to re-parse when the // size changes, but we do need to invalidate any existing UIs. @@ -541,6 +541,24 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) } } + private[history] def shouldReloadLog(info: LogInfo, entry: FileStatus): Boolean = { +var result = info.fileSize < entry.getLen +if (!result && info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) { + try { +result = Utils.tryWithResource(fs.open(entry.getPath)) { in => + in.getWrappedStream match { +case dfsIn: DFSInputStream => info.fileSize < dfsIn.getFileLength +case _ => false + } +} + } catch { +case e: Exception =>
spark git commit: [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs
Repository: spark Updated Branches: refs/heads/master 9b98d9166 -> 46d2d2c74 [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs ## What changes were proposed in this pull request? `hsync` has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. `hsync` uses the force `FileChannel.force` to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events. I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api `DFSInputStream.getFileLength()` which gives the file length including the `lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui. ## How was this patch tested? Added new test and verified manually, with the added conf `spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data. Closes #22752 from devaraj-kavali/SPARK-24787. Authored-by: Devaraj K Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/46d2d2c7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/46d2d2c7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/46d2d2c7 Branch: refs/heads/master Commit: 46d2d2c74d9aaf30e158aeda58a189f6c8e48b9c Parents: 9b98d91 Author: Devaraj K Authored: Thu Oct 25 13:16:08 2018 -0700 Committer: Marcelo Vanzin Committed: Thu Oct 25 13:16:08 2018 -0700 -- .../deploy/history/FsHistoryProvider.scala | 22 ++-- .../spark/scheduler/EventLoggingListener.scala | 8 + .../deploy/history/FsHistoryProviderSuite.scala | 37 ++-- 3 files changed, 56 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/46d2d2c7/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala index c23a659..c4517d3 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala @@ -34,7 +34,7 @@ import com.fasterxml.jackson.annotation.JsonIgnore import com.google.common.io.ByteStreams import com.google.common.util.concurrent.MoreExecutors import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} -import org.apache.hadoop.hdfs.DistributedFileSystem +import org.apache.hadoop.hdfs.{DFSInputStream, DistributedFileSystem} import org.apache.hadoop.hdfs.protocol.HdfsConstants import org.apache.hadoop.security.AccessControlException import org.fusesource.leveldbjni.internal.NativeDB @@ -449,7 +449,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen())) } -if (info.fileSize < entry.getLen()) { +if (shouldReloadLog(info, entry)) { if (info.appId.isDefined && fastInProgressParsing) { // When fast in-progress parsing is on, we don't need to re-parse when the // size changes, but we do need to invalidate any existing UIs. @@ -541,6 +541,24 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) } } + private[history] def shouldReloadLog(info: LogInfo, entry: FileStatus): Boolean = { +var result = info.fileSize < entry.getLen +if (!result && info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) { + try { +result = Utils.tryWithResource(fs.open(entry.getPath)) { in => + in.getWrappedStream match { +case dfsIn: DFSInputStream => info.fileSize < dfsIn.getFileLength +case _ => false + } +} + } catch { +case e: Exception => + logDebug(s"Failed to check the length for the file : ${info.logPath}", e) + } +} +
spark git commit: [SPARK-25803][K8S] Fix docker-image-tool.sh -n option
Repository: spark Updated Branches: refs/heads/branch-2.4 a20660b6f -> 45ed76d6f [SPARK-25803][K8S] Fix docker-image-tool.sh -n option ## What changes were proposed in this pull request? docker-image-tool.sh uses getopts in which a colon signifies that an option takes an argument. Since -n does not take an argument it should not have a colon. ## How was this patch tested? Following the reproduction in [JIRA](https://issues.apache.org/jira/browse/SPARK-25803):- 0. Created a custom Dockerfile to use for the spark-r container image. In each of the steps below the path to this Dockerfile is passed with the '-R' option. (spark-r is used here simply as an example, the bug applies to all options) 1. Built container images without '-n'. The [result](https://gist.github.com/sel/59f0911bb1a6a485c2487cf7ca770f9d) is that the '-R' option is honoured and the hello-world image is built for spark-r, as expected. 2. Built container images with '-n' to reproduce the issue The [result](https://gist.github.com/sel/e5cabb9f3bdad5d087349e7fbed75141) is that the '-R' option is ignored and the default container image for spark-r is built 3. Applied the patch and re-built container images with '-n' and did not reproduce the issue The [result](https://gist.github.com/sel/6af14b95012ba8ff267a4fce6e3bd3bf) is that the '-R' option is honoured and the hello-world image is built for spark-r, as expected. Closes #22798 from sel/fix-docker-image-tool-nocache. Authored-by: Steve Signed-off-by: Marcelo Vanzin (cherry picked from commit 9b98d9166ee2c130ba38a09e8c0aa12e29676b76) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45ed76d6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45ed76d6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45ed76d6 Branch: refs/heads/branch-2.4 Commit: 45ed76d6f66d58c99c20d9e757aa2177d9707968 Parents: a20660b Author: Steve Authored: Thu Oct 25 13:00:59 2018 -0700 Committer: Marcelo Vanzin Committed: Thu Oct 25 13:01:10 2018 -0700 -- bin/docker-image-tool.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/45ed76d6/bin/docker-image-tool.sh -- diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh index 228494d..5e8eaff 100755 --- a/bin/docker-image-tool.sh +++ b/bin/docker-image-tool.sh @@ -145,7 +145,7 @@ PYDOCKERFILE= RDOCKERFILE= NOCACHEARG= BUILD_PARAMS= -while getopts f:p:R:mr:t:n:b: option +while getopts f:p:R:mr:t:nb: option do case "${option}" in - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25803][K8S] Fix docker-image-tool.sh -n option
Repository: spark Updated Branches: refs/heads/master ccd07b736 -> 9b98d9166 [SPARK-25803][K8S] Fix docker-image-tool.sh -n option ## What changes were proposed in this pull request? docker-image-tool.sh uses getopts in which a colon signifies that an option takes an argument. Since -n does not take an argument it should not have a colon. ## How was this patch tested? Following the reproduction in [JIRA](https://issues.apache.org/jira/browse/SPARK-25803):- 0. Created a custom Dockerfile to use for the spark-r container image. In each of the steps below the path to this Dockerfile is passed with the '-R' option. (spark-r is used here simply as an example, the bug applies to all options) 1. Built container images without '-n'. The [result](https://gist.github.com/sel/59f0911bb1a6a485c2487cf7ca770f9d) is that the '-R' option is honoured and the hello-world image is built for spark-r, as expected. 2. Built container images with '-n' to reproduce the issue The [result](https://gist.github.com/sel/e5cabb9f3bdad5d087349e7fbed75141) is that the '-R' option is ignored and the default container image for spark-r is built 3. Applied the patch and re-built container images with '-n' and did not reproduce the issue The [result](https://gist.github.com/sel/6af14b95012ba8ff267a4fce6e3bd3bf) is that the '-R' option is honoured and the hello-world image is built for spark-r, as expected. Closes #22798 from sel/fix-docker-image-tool-nocache. Authored-by: Steve Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b98d916 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b98d916 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b98d916 Branch: refs/heads/master Commit: 9b98d9166ee2c130ba38a09e8c0aa12e29676b76 Parents: ccd07b7 Author: Steve Authored: Thu Oct 25 13:00:59 2018 -0700 Committer: Marcelo Vanzin Committed: Thu Oct 25 13:00:59 2018 -0700 -- bin/docker-image-tool.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9b98d916/bin/docker-image-tool.sh -- diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh index 7256355..61959ca 100755 --- a/bin/docker-image-tool.sh +++ b/bin/docker-image-tool.sh @@ -182,7 +182,7 @@ PYDOCKERFILE= RDOCKERFILE= NOCACHEARG= BUILD_PARAMS= -while getopts f:p:R:mr:t:n:b: option +while getopts f:p:R:mr:t:nb: option do case "${option}" in - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to…
Repository: spark Updated Branches: refs/heads/master 6540c2f8f -> ccd07b736 [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to⦠## What changes were proposed in this pull request? Refactor ObjectHashAggregateExecBenchmark to use main method ## How was this patch tested? Manually tested: ``` bin/spark-submit --class org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark --jars sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar,core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT.jar --packages org.spark-project.hive:hive-exec:1.2.1.spark2 sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT-tests.jar ``` Generated results with: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "hive/test:runMain org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark" ``` Closes #22804 from peter-toth/SPARK-25665. Lead-authored-by: Peter Toth Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ccd07b73 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ccd07b73 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ccd07b73 Branch: refs/heads/master Commit: ccd07b736640c87ac6980a1c7c2d706ef3bab1bf Parents: 6540c2f Author: Peter Toth Authored: Thu Oct 25 12:42:31 2018 -0700 Committer: Dongjoon Hyun Committed: Thu Oct 25 12:42:31 2018 -0700 -- ...ObjectHashAggregateExecBenchmark-results.txt | 45 .../ObjectHashAggregateExecBenchmark.scala | 218 +-- 2 files changed, 152 insertions(+), 111 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ccd07b73/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt -- diff --git a/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt b/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt new file mode 100644 index 000..f3044da --- /dev/null +++ b/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt @@ -0,0 +1,45 @@ + +Hive UDAF vs Spark AF + + +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +hive udaf vs spark af: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +hive udaf w/o group by6370 / 6400 0.0 97193.6 1.0X +spark af w/o group by 54 / 63 1.2 820.8 118.4X +hive udaf w/ group by 4492 / 4507 0.0 68539.5 1.4X +spark af w/ group by w/o fallback 58 / 64 1.1 881.7 110.2X +spark af w/ group by w/ fallback 136 / 142 0.5 2075.0 46.8X + + + +ObjectHashAggregateExec vs SortAggregateExec - typed_count + + +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +object agg v.s. sort agg:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +sort agg w/ group by41500 / 41630 2.5 395.8 1.0X +object agg w/ group by w/o fallback 10075 / 10122 10.4 96.1 4.1X +object agg w/ group by w/ fallback 28131 / 28205 3.7 268.3 1.5X +sort agg w/o group by 6182 / 6221 17.0 59.0 6.7X +object agg w/o group by w/o fallback 5435 / 5468 19.3 51.8 7.6X + + + +ObjectHashAggregateExec vs SortAggregateExec - percentile_approx + + +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +object agg v.s. sort agg:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative
svn commit: r30410 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_12_02-6540c2f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Oct 25 19:17:13 2018 New Revision: 30410 Log: Apache Spark 3.0.0-SNAPSHOT-2018_10_25_12_02-6540c2f docs [This commit notification would consist of 1474 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark-website git commit: Use Heilmeier Catechism for SPIP template.
Repository: spark-website Updated Branches: refs/heads/asf-site e4b87718d -> 005a2a0d1 Use Heilmeier Catechism for SPIP template. Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/005a2a0d Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/005a2a0d Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/005a2a0d Branch: refs/heads/asf-site Commit: 005a2a0d1d88c893518d98cddcb7d373a562b339 Parents: e4b8771 Author: Reynold Xin Authored: Wed Oct 24 11:51:43 2018 -0700 Committer: Reynold Xin Committed: Thu Oct 25 11:25:30 2018 -0700 -- improvement-proposals.md| 34 ++ site/improvement-proposals.html | 32 2 files changed, 42 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/005a2a0d/improvement-proposals.md -- diff --git a/improvement-proposals.md b/improvement-proposals.md index 8fab696..55d57d9 100644 --- a/improvement-proposals.md +++ b/improvement-proposals.md @@ -11,7 +11,7 @@ navigation: The purpose of an SPIP is to inform and involve the user community in major improvements to the Spark codebase throughout the development process, to increase the likelihood that user needs are met. -SPIPs should be used for significant user-facing or cross-cutting changes, not small incremental improvements. When in doubt, if a committer thinks a change needs an SPIP, it does. +SPIPs should be used for significant user-facing or cross-cutting changes, not small incremental improvements. When in doubt, if a committer thinks a change needs an SPIP, it does. What is a SPIP? @@ -48,30 +48,40 @@ Any community member can help by discussing whether an SPIP is SPIP Process Proposing an SPIP -Anyone may propose an SPIP, using the template below. Please only submit an SPIP if you are willing to help, at least with discussion. +Anyone may propose an SPIP, using the document template below. Please only submit an SPIP if you are willing to help, at least with discussion. After a SPIP is created, the author should email mailto:d...@spark.apache.org;>d...@spark.apache.org to notify the community of the SPIP, and discussions should ensue on the JIRA ticket. If an SPIP is too small or incremental and should have been done through the normal JIRA process, a committer should remove the SPIP label. -Template for an SPIP +SPIP Document Template - -Background and Motivation: What problem is this solving? +A SPIP document is a short document with a few questions, inspired by the Heilmeier Catechism: -Target Personas: Examples include data scientists, data engineers, library developers, devops. A single SPIP can have multiple target personas. +Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. -Goals: What must this allow users to do, that they can't currently? +Q2. What problem is this proposal NOT designed to solve? -Non-Goals: What problem is this proposal not designed to solve? +Q3. How is it done today, and what are the limits of current practice? -Proposed API Changes: Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. +Q4. What is new in your approach and why do you think it will be successful? + +Q5. Who cares? If you are successful, what difference will it make? + +Q6. What are the risks? + +Q7. How long will it take? + +Q8. What are the mid-term and final âexamsâ to check for success? + +Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. + +Appendix B. Optional Design Sketch: How are the goals going to be accomplished? Give sufficient technical detail to allow a contributor to judge whether it's likely to be feasible. Note that this is not a full design document. + +Appendix C. Optional Rejected Designs: What alternatives were considered? Why were they rejected? If no alternatives have been considered, the problem needs more thought. -Optional Design Sketch: How are the goals going to be accomplished? Give sufficient technical detail to allow a contributor to judge whether it's likely to be feasible. This is not a full design document. -Optional Rejected Designs: What alternatives were considered? Why were they rejected? If no alternatives have been considered, the problem needs more thought. - Discussing an SPIP http://git-wip-us.apache.org/repos/asf/spark-website/blob/005a2a0d/site/improvement-proposals.html -- diff --git a/site/improvement-proposals.html
svn commit: r30409 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_10_02-a20660b-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Oct 25 17:17:26 2018 New Revision: 30409 Log: Apache Spark 2.4.1-SNAPSHOT-2018_10_25_10_02-a20660b docs [This commit notification would consist of 1478 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r30407 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_08_03-002f9c1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Oct 25 15:17:40 2018 New Revision: 30407 Log: Apache Spark 3.0.0-SNAPSHOT-2018_10_25_08_03-002f9c1 docs [This commit notification would consist of 1473 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide
Repository: spark Updated Branches: refs/heads/branch-2.4 d5e694805 -> a20660b6f [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide ## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png;> Java: https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png;> Python: https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png;> R: https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png;> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu Signed-off-by: Wenchen Fan (cherry picked from commit 6540c2f8f31bbde4df57e48698f46bb1815740ff) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a20660b6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a20660b6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a20660b6 Branch: refs/heads/branch-2.4 Commit: a20660b6fcee2d436ef182337255cea5e2eb7216 Parents: d5e6948 Author: WeichenXu Authored: Thu Oct 25 23:03:16 2018 +0800 Committer: Wenchen Fan Committed: Thu Oct 25 23:04:06 2018 +0800 -- docs/_data/menu-ml.yaml | 2 + docs/ml-datasource.md | 108 +++ .../spark/ml/source/image/ImageDataSource.scala | 17 +-- 3 files changed, 120 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/docs/_data/menu-ml.yaml -- diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml index b5a6641..8e366f7 100644 --- a/docs/_data/menu-ml.yaml +++ b/docs/_data/menu-ml.yaml @@ -1,5 +1,7 @@ - text: Basic statistics url: ml-statistics.html +- text: Data sources + url: ml-datasource - text: Pipelines url: ml-pipeline.html - text: Extracting, transforming and selecting features http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/docs/ml-datasource.md -- diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md new file mode 100644 index 000..1508332 --- /dev/null +++ b/docs/ml-datasource.md @@ -0,0 +1,108 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library. +The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema. +The schema of the `image` column is: + - origin: `StringType` (represents the file path of the image) + - height: `IntegerType` (height of the image) + - width: `IntegerType` (width of the image) + - nChannels: `IntegerType` (number of image channels) + - mode: `IntegerType` (OpenCV-compatible type) + - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) + + + + +[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) +implements a Spark SQL data source API for loading image data as a DataFrame. + +{% highlight scala %} +scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens") +df: org.apache.spark.sql.DataFrame = [image: struct] + +scala> df.select("image.origin", "image.width", "image.height").show(truncate=false) ++---+-+--+ +|origin |width|height| ++---+-+--+ +|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 | +|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg|199 |313 | +|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 | +|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg|300 |296 | ++---+-+--+ +{% endhighlight %} + + +
spark git commit: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide
Repository: spark Updated Branches: refs/heads/master 002f9c169 -> 6540c2f8f [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide ## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png;> Java: https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png;> Python: https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png;> R: https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png;> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6540c2f8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6540c2f8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6540c2f8 Branch: refs/heads/master Commit: 6540c2f8f31bbde4df57e48698f46bb1815740ff Parents: 002f9c1 Author: WeichenXu Authored: Thu Oct 25 23:03:16 2018 +0800 Committer: Wenchen Fan Committed: Thu Oct 25 23:03:16 2018 +0800 -- docs/_data/menu-ml.yaml | 2 + docs/ml-datasource.md | 108 +++ .../spark/ml/source/image/ImageDataSource.scala | 17 +-- 3 files changed, 120 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/docs/_data/menu-ml.yaml -- diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml index b5a6641..8e366f7 100644 --- a/docs/_data/menu-ml.yaml +++ b/docs/_data/menu-ml.yaml @@ -1,5 +1,7 @@ - text: Basic statistics url: ml-statistics.html +- text: Data sources + url: ml-datasource - text: Pipelines url: ml-pipeline.html - text: Extracting, transforming and selecting features http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/docs/ml-datasource.md -- diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md new file mode 100644 index 000..1508332 --- /dev/null +++ b/docs/ml-datasource.md @@ -0,0 +1,108 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library. +The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema. +The schema of the `image` column is: + - origin: `StringType` (represents the file path of the image) + - height: `IntegerType` (height of the image) + - width: `IntegerType` (width of the image) + - nChannels: `IntegerType` (number of image channels) + - mode: `IntegerType` (OpenCV-compatible type) + - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) + + + + +[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) +implements a Spark SQL data source API for loading image data as a DataFrame. + +{% highlight scala %} +scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens") +df: org.apache.spark.sql.DataFrame = [image: struct] + +scala> df.select("image.origin", "image.width", "image.height").show(truncate=false) ++---+-+--+ +|origin |width|height| ++---+-+--+ +|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 | +|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg|199 |313 | +|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 | +|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg|300 |296 | ++---+-+--+ +{% endhighlight %} + + + +[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html) +implements Spark SQL data source API for loading image data as DataFrame. + +{% highlight java %}
spark git commit: [SPARK-24794][CORE] Driver launched through rest should use all masters
Repository: spark Updated Branches: refs/heads/master 65c653fb4 -> 002f9c169 [SPARK-24794][CORE] Driver launched through rest should use all masters ## What changes were proposed in this pull request? In standalone cluster mode, one could launch driver with supervise mode enabled. StandaloneRestServer class uses the host and port of current master as the spark.master property while launching the driver (even if you are running in HA mode). This class also ignores the spark.master property passed as part of the request. Due to the above problem, if the Spark masters switch due to some reason and your driver is killed unexpectedly and relaunched, it will try to connect to the master which is in the driver command specified as -Dspark.master. But this master will be in STANDBY mode and after trying multiple times, the SparkContext will kill itself (even though secondary master was alive and healthy). This change picks the spark.master property from request and uses it to launch the driver process. Due to this, the driver process has both masters in -Dspark.master property. Even if the masters switch, SparkContext can still connect to the ALIVE master and work correctly. ## How was this patch tested? This patch was manually tested on a standalone cluster running 2.2.1. It was rebased on current master and all tests were executed. I have added a unit test for this change (but since I am new I hope I have covered all). Closes #21816 from bsikander/rest_driver_fix. Authored-by: Behroz Sikander Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/002f9c16 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/002f9c16 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/002f9c16 Branch: refs/heads/master Commit: 002f9c169eb20b0d71b6d0296595f343c7f5bab2 Parents: 65c653f Author: Behroz Sikander Authored: Thu Oct 25 08:36:44 2018 -0500 Committer: Sean Owen Committed: Thu Oct 25 08:36:44 2018 -0500 -- .../deploy/rest/StandaloneRestServer.scala | 12 +++- .../deploy/rest/StandaloneRestSubmitSuite.scala | 20 2 files changed, 31 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/002f9c16/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala b/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala index 22b65ab..afa1a5f 100644 --- a/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala @@ -138,6 +138,16 @@ private[rest] class StandaloneSubmitRequestServlet( val driverExtraClassPath = sparkProperties.get("spark.driver.extraClassPath") val driverExtraLibraryPath = sparkProperties.get("spark.driver.extraLibraryPath") val superviseDriver = sparkProperties.get("spark.driver.supervise") +// The semantics of "spark.master" and the masterUrl are different. While the +// property "spark.master" could contain all registered masters, masterUrl +// contains only the active master. To make sure a Spark driver can recover +// in a multi-master setup, we use the "spark.master" property while submitting +// the driver. +val masters = sparkProperties.get("spark.master") +val (_, masterPort) = Utils.extractHostPortFromSparkUrl(masterUrl) +val masterRestPort = this.conf.getInt("spark.master.rest.port", 6066) +val updatedMasters = masters.map( + _.replace(s":$masterRestPort", s":$masterPort")).getOrElse(masterUrl) val appArgs = request.appArgs // Filter SPARK_LOCAL_(IP|HOSTNAME) environment variables from being set on the remote system. val environmentVariables = @@ -146,7 +156,7 @@ private[rest] class StandaloneSubmitRequestServlet( // Construct driver description val conf = new SparkConf(false) .setAll(sparkProperties) - .set("spark.master", masterUrl) + .set("spark.master", updatedMasters) val extraClassPath = driverExtraClassPath.toSeq.flatMap(_.split(File.pathSeparator)) val extraLibraryPath = driverExtraLibraryPath.toSeq.flatMap(_.split(File.pathSeparator)) val extraJavaOpts = driverExtraJavaOptions.map(Utils.splitCommandString).getOrElse(Seq.empty) http://git-wip-us.apache.org/repos/asf/spark/blob/002f9c16/core/src/test/scala/org/apache/spark/deploy/rest/StandaloneRestSubmitSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/deploy/rest/StandaloneRestSubmitSuite.scala
spark git commit: [BUILD] Close stale PRs
Repository: spark Updated Branches: refs/heads/master 3123c7f48 -> 65c653fb4 [BUILD] Close stale PRs Closes #22567 Closes #18457 Closes #21517 Closes #21858 Closes #22383 Closes #19219 Closes #22401 Closes #22811 Closes #20405 Closes #21933 Closes #22819 from srowen/ClosePRs. Authored-by: Sean Owen Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65c653fb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65c653fb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65c653fb Branch: refs/heads/master Commit: 65c653fb455336948c5af2e0f381d1d8f5640874 Parents: 3123c7f Author: Sean Owen Authored: Thu Oct 25 08:35:27 2018 -0500 Committer: Sean Owen Committed: Thu Oct 25 08:35:27 2018 -0500 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25808][BUILD] Upgrade jsr305 version from 1.3.9 to 3.0.0
Repository: spark Updated Branches: refs/heads/master cb5ea201d -> 3123c7f48 [SPARK-25808][BUILD] Upgrade jsr305 version from 1.3.9 to 3.0.0 ## What changes were proposed in this pull request? We find below warnings when build spark project: ``` [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9 [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0) [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) ``` So ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning Upgrade one of the dependencies jsr305 version from 1.3.9 to 3.0.0 ## How was this patch tested? sbt "core/testOnly" sbt "sql/testOnly" Closes #22803 from daviddingly/master. Authored-by: xiaoding Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3123c7f4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3123c7f4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3123c7f4 Branch: refs/heads/master Commit: 3123c7f4881225e912842e9897b0e0e6bd0f4b20 Parents: cb5ea20 Author: xiaoding Authored: Thu Oct 25 07:06:17 2018 -0500 Committer: Sean Owen Committed: Thu Oct 25 07:06:17 2018 -0500 -- dev/deps/spark-deps-hadoop-2.7 | 2 +- dev/deps/spark-deps-hadoop-3.1 | 2 +- pom.xml| 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3123c7f4/dev/deps/spark-deps-hadoop-2.7 -- diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index 06173f7..537831e 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -127,7 +127,7 @@ json4s-core_2.11-3.5.3.jar json4s-jackson_2.11-3.5.3.jar json4s-scalap_2.11-3.5.3.jar jsp-api-2.1.jar -jsr305-1.3.9.jar +jsr305-3.0.0.jar jta-1.1.jar jtransforms-2.4.0.jar jul-to-slf4j-1.7.16.jar http://git-wip-us.apache.org/repos/asf/spark/blob/3123c7f4/dev/deps/spark-deps-hadoop-3.1 -- diff --git a/dev/deps/spark-deps-hadoop-3.1 b/dev/deps/spark-deps-hadoop-3.1 index 62fddf0..bc4ef31 100644 --- a/dev/deps/spark-deps-hadoop-3.1 +++ b/dev/deps/spark-deps-hadoop-3.1 @@ -128,7 +128,7 @@ json4s-core_2.11-3.5.3.jar json4s-jackson_2.11-3.5.3.jar json4s-scalap_2.11-3.5.3.jar jsp-api-2.1.jar -jsr305-1.3.9.jar +jsr305-3.0.0.jar jta-1.1.jar jtransforms-2.4.0.jar jul-to-slf4j-1.7.16.jar http://git-wip-us.apache.org/repos/asf/spark/blob/3123c7f4/pom.xml -- diff --git a/pom.xml b/pom.xml index b1f0a53..92934c1 100644 --- a/pom.xml +++ b/pom.xml @@ -172,7 +172,7 @@ 2.22.2 2.9.3 3.5.2 -1.3.9 +3.0.0 0.9.3 4.7 1.1 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25746][SQL] Refactoring ExpressionEncoder to get rid of flat flag
Repository: spark Updated Branches: refs/heads/master ddd1b1e8a -> cb5ea201d [SPARK-25746][SQL] Refactoring ExpressionEncoder to get rid of flat flag ## What changes were proposed in this pull request? This is inspired during implementing #21732. For now `ScalaReflection` needs to consider how `ExpressionEncoder` uses generated serializers and deserializers. And `ExpressionEncoder` has a weird `flat` flag. After discussion with cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should make SPARK-24762 easier to do. To summarize the proposed changes: 1. `serializerFor` and `deserializerFor` return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly. 2. `serializerForType` and `deserializerForType` returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row. So in other words, `serializerForType` and `deserializerForType` return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value. A serializer returned by `serializerForType` will serialize an object at `row(0)` to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct. A deserializer returned by `deserializerForType` will deserialize an input field at `row(0)` to an object with given type. 3. The construction of `ExpressionEncoder` takes a pair of serializer and deserializer for type `T`. It uses them to create serializer and deserializer for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember if serializer is flat or not. When we need to construct new `ExpressionEncoder` based on existing ones, we only need to change input location in the atomic serializer and deserializer. ## How was this patch tested? Existing tests. Closes #22749 from viirya/SPARK-24762-refactor. Authored-by: Liang-Chi Hsieh Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb5ea201 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb5ea201 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb5ea201 Branch: refs/heads/master Commit: cb5ea201df5fae8aacb653ffb4147b9288bca1e9 Parents: ddd1b1e Author: Liang-Chi Hsieh Authored: Thu Oct 25 19:27:45 2018 +0800 Committer: Wenchen Fan Committed: Thu Oct 25 19:27:45 2018 +0800 -- .../scala/org/apache/spark/sql/Encoders.scala | 8 +- .../spark/sql/catalyst/JavaTypeInference.scala | 78 --- .../spark/sql/catalyst/ScalaReflection.scala| 182 - .../catalyst/encoders/ExpressionEncoder.scala | 201 +++ .../sql/catalyst/encoders/RowEncoder.scala | 16 +- .../sql/catalyst/ScalaReflectionSuite.scala | 70 --- .../encoders/ExpressionEncoderSuite.scala | 6 +- .../sql/catalyst/encoders/RowEncoderSuite.scala | 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 10 +- .../spark/sql/KeyValueGroupedDataset.scala | 2 +- .../aggregate/TypedAggregateExpression.scala| 12 +- .../org/apache/spark/sql/DatasetSuite.scala | 2 +- 12 files changed, 304 insertions(+), 285 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb5ea201/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala index b47ec0b..8a30c81 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala @@ -203,12 +203,10 @@ object Encoders { validatePublicClass[T]() ExpressionEncoder[T]( - schema = new StructType().add("value", BinaryType), - flat = true, - serializer = Seq( + objSerializer = EncodeUsingSerializer( - BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), kryo = useKryo)), - deserializer = + BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), kryo = useKryo), + objDeserializer = DecodeUsingSerializer[T]( Cast(GetColumnByOrdinal(0, BinaryType), BinaryType), classTag[T], http://git-wip-us.apache.org/repos/asf/spark/blob/cb5ea201/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
svn commit: r30404 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_00_03-ddd1b1e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Oct 25 07:17:50 2018 New Revision: 30404 Log: Apache Spark 3.0.0-SNAPSHOT-2018_10_25_00_03-ddd1b1e docs [This commit notification would consist of 1473 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24572][SPARKR] "eager execution" for R shell, IDE
Repository: spark Updated Branches: refs/heads/master 19ada15d1 -> ddd1b1e8a [SPARK-24572][SPARKR] "eager execution" for R shell, IDE ## What changes were proposed in this pull request? Check the `spark.sql.repl.eagerEval.enabled` configuration property in SparkDataFrame `show()` method. If the `SparkSession` has eager execution enabled, the data will be returned to the R client when the data frame is created. So instead of seeing this ``` > df <- createDataFrame(faithful) > df SparkDataFrame[eruptions:double, waiting:double] ``` you will see ``` > df <- createDataFrame(faithful) > df +-+---+ |eruptions|waiting| +-+---+ | 3.6| 79.0| | 1.8| 54.0| |3.333| 74.0| |2.283| 62.0| |4.533| 85.0| |2.883| 55.0| | 4.7| 88.0| | 3.6| 85.0| | 1.95| 51.0| | 4.35| 85.0| |1.833| 54.0| |3.917| 84.0| | 4.2| 78.0| | 1.75| 47.0| | 4.7| 83.0| |2.167| 52.0| | 1.75| 62.0| | 4.8| 84.0| | 1.6| 52.0| | 4.25| 79.0| +-+---+ only showing top 20 rows ``` ## How was this patch tested? Manual tests as well as unit tests (one new test case is added). Author: adrian555 Closes #22455 from adrian555/eager_execution. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ddd1b1e8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ddd1b1e8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ddd1b1e8 Branch: refs/heads/master Commit: ddd1b1e8aec023e61b186c494ccbc182db2eb3ca Parents: 19ada15 Author: adrian555 Authored: Wed Oct 24 23:42:06 2018 -0700 Committer: Felix Cheung Committed: Wed Oct 24 23:42:06 2018 -0700 -- R/pkg/R/DataFrame.R | 36 -- R/pkg/tests/fulltests/test_sparkSQL_eager.R | 72 docs/sparkr.md | 42 .../org/apache/spark/sql/internal/SQLConf.scala | 7 +- 4 files changed, 148 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ddd1b1e8/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 3469188..bf82d0c 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -226,7 +226,9 @@ setMethod("showDF", #' show #' -#' Print class and type information of a Spark object. +#' If eager evaluation is enabled and the Spark object is a SparkDataFrame, evaluate the +#' SparkDataFrame and print top rows of the SparkDataFrame, otherwise, print the class +#' and type information of the Spark object. #' #' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, WindowSpec. #' @@ -244,11 +246,33 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +allConf <- sparkR.conf() +prop <- allConf[["spark.sql.repl.eagerEval.enabled"]] +if (!is.null(prop) && identical(prop, "true")) { + argsList <- list() + argsList$x <- object + prop <- allConf[["spark.sql.repl.eagerEval.maxNumRows"]] + if (!is.null(prop)) { +numRows <- as.integer(prop) +if (numRows > 0) { + argsList$numRows <- numRows +} + } + prop <- allConf[["spark.sql.repl.eagerEval.truncate"]] + if (!is.null(prop)) { +truncate <- as.integer(prop) +if (truncate > 0) { + argsList$truncate <- truncate +} + } + do.call(showDF, argsList) +} else { + cols <- lapply(dtypes(object), function(l) { +paste(l, collapse = ":") + }) + s <- paste(cols, collapse = ", ") + cat(paste(class(object), "[", s, "]\n", sep = "")) +} }) #' DataTypes http://git-wip-us.apache.org/repos/asf/spark/blob/ddd1b1e8/R/pkg/tests/fulltests/test_sparkSQL_eager.R -- diff --git a/R/pkg/tests/fulltests/test_sparkSQL_eager.R b/R/pkg/tests/fulltests/test_sparkSQL_eager.R new file mode 100644 index 000..df7354f --- /dev/null +++ b/R/pkg/tests/fulltests/test_sparkSQL_eager.R @@ -0,0 +1,72 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +#
spark git commit: [SPARK-24516][K8S] Change Python default to Python3
Repository: spark Updated Branches: refs/heads/master b2e325625 -> 19ada15d1 [SPARK-24516][K8S] Change Python default to Python3 ## What changes were proposed in this pull request? As this is targeted for 3.0.0 and Python2 will be deprecated by Jan 1st, 2020, I feel it is appropriate to change the default to Python3. Especially as these projects [found here](https://python3statement.org/) are deprecating their support. ## How was this patch tested? Unit and Integration tests Author: Ilan Filonenko Closes #22810 from ifilonenko/SPARK-24516. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/19ada15d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/19ada15d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/19ada15d Branch: refs/heads/master Commit: 19ada15d1b15256de4e3bf2f4b17d87ea0d65cc3 Parents: b2e3256 Author: Ilan Filonenko Authored: Wed Oct 24 23:29:47 2018 -0700 Committer: Felix Cheung Committed: Wed Oct 24 23:29:47 2018 -0700 -- docs/running-on-kubernetes.md | 2 +- .../core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/19ada15d/docs/running-on-kubernetes.md -- diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index d629ed3..60c9279 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -816,7 +816,7 @@ specific to Spark on Kubernetes. spark.kubernetes.pyspark.pythonVersion - "2" + "3" This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3. http://git-wip-us.apache.org/repos/asf/spark/blob/19ada15d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala -- diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala index c2ad80c..fff8fa4 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala @@ -223,7 +223,7 @@ private[spark] object Config extends Logging { .stringConf .checkValue(pv => List("2", "3").contains(pv), "Ensure that major Python version is either Python2 or Python3") - .createWithDefault("2") + .createWithDefault("3") val KUBERNETES_KERBEROS_KRB5_FILE = ConfigBuilder("spark.kubernetes.kerberos.krb5.path") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org