date:20181025

svn commit: r30417 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_22_02-eff1c50-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Fri Oct 26 05:19:04 2018
New Revision: 30417

Log:
Apache Spark 2.4.1-SNAPSHOT-2018_10_25_22_02-eff1c50 docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 eff1c5016 -> f37bceadf


[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
See the detailed information at 
https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be 
deprecated and redesigned.

This patch also reverts 
https://github.com/apache/spark/commit/8acb51f08b448628b65e90af3b268994f9550e45 
which applies to 2.4.

## How was this patch tested?
Only deprecation and doc changes.

Closes #22841 from rxin/SPARK-25842.

Authored-by: Reynold Xin 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 89d748b33c8636a1b1411c505921b0a585e1e6cb)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f37bcead
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f37bcead
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f37bcead

Branch: refs/heads/branch-2.4
Commit: f37bceadf2135348c006c3d37ab7d6101cfe2267
Parents: eff1c50
Author: Reynold Xin 
Authored: Fri Oct 26 13:17:24 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 26 13:17:50 2018 +0800

--
 python/pyspark/sql/functions.py | 30 -
 python/pyspark/sql/window.py| 70 +---
 .../apache/spark/sql/expressions/Window.scala   | 46 +
 .../spark/sql/expressions/WindowSpec.scala  | 45 +
 .../scala/org/apache/spark/sql/functions.scala  | 12 ++--
 5 files changed, 28 insertions(+), 175 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f37bcead/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 785e55e..9485c28 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -855,36 +855,6 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
-@since(2.4)
-def unboundedPreceding():
-"""
-Window function: returns the special frame boundary that represents the 
first row
-in the window partition.
-"""
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.functions.unboundedPreceding())
-
-
-@since(2.4)
-def unboundedFollowing():
-"""
-Window function: returns the special frame boundary that represents the 
last row
-in the window partition.
-"""
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.functions.unboundedFollowing())
-
-
-@since(2.4)
-def currentRow():
-"""
-Window function: returns the special frame boundary that represents the 
current row
-in the window partition.
-"""
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.functions.currentRow())
-
-
 # -- Date/Timestamp functions 
--
 
 @since(1.5)

http://git-wip-us.apache.org/repos/asf/spark/blob/f37bcead/python/pyspark/sql/window.py
--
diff --git a/python/pyspark/sql/window.py b/python/pyspark/sql/window.py
index d19ced9..e76563d 100644
--- a/python/pyspark/sql/window.py
+++ b/python/pyspark/sql/window.py
@@ -16,11 +16,9 @@
 #
 
 import sys
-if sys.version >= '3':
-long = int
 
 from pyspark import since, SparkContext
-from pyspark.sql.column import Column, _to_seq, _to_java_column
+from pyspark.sql.column import _to_seq, _to_java_column
 
 __all__ = ["Window", "WindowSpec"]
 
@@ -126,45 +124,20 @@ class Window(object):
 and "5" means the five off after the current row.
 
 We recommend users use ``Window.unboundedPreceding``, 
``Window.unboundedFollowing``,
-``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``,
-``pyspark.sql.functions.unboundedFollowing`` and 
``pyspark.sql.functions.currentRow``
-to specify special boundary values, rather than using integral values 
directly.
+and ``Window.currentRow`` to specify special boundary values, rather 
than using integral
+values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``,
-  a column returned by 
``pyspark.sql.functions.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``,
-a column returned by 
``pyspark.sql.functions.unboundedFollowing``, or
+

svn commit: r30416 - in /dev/spark/2.3.3-SNAPSHOT-2018_10_25_22_02-0a05cf9-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Fri Oct 26 05:17:42 2018
New Revision: 30416

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_10_25_22_02-0a05cf9 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 86d469aea -> 89d748b33


[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
See the detailed information at 
https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be 
deprecated and redesigned.

This patch also reverts 
https://github.com/apache/spark/commit/8acb51f08b448628b65e90af3b268994f9550e45 
which applies to 2.4.

## How was this patch tested?
Only deprecation and doc changes.

Closes #22841 from rxin/SPARK-25842.

Authored-by: Reynold Xin 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/89d748b3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/89d748b3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/89d748b3

Branch: refs/heads/master
Commit: 89d748b33c8636a1b1411c505921b0a585e1e6cb
Parents: 86d469a
Author: Reynold Xin 
Authored: Fri Oct 26 13:17:24 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 26 13:17:24 2018 +0800

--
 python/pyspark/sql/functions.py | 30 -
 python/pyspark/sql/window.py| 70 +---
 .../apache/spark/sql/expressions/Window.scala   | 46 +
 .../spark/sql/expressions/WindowSpec.scala  | 45 +
 .../scala/org/apache/spark/sql/functions.scala  | 12 ++--
 5 files changed, 28 insertions(+), 175 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/89d748b3/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 8b2e423..739496b 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -858,36 +858,6 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
-@since(2.4)
-def unboundedPreceding():
-"""
-Window function: returns the special frame boundary that represents the 
first row
-in the window partition.
-"""
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.functions.unboundedPreceding())
-
-
-@since(2.4)
-def unboundedFollowing():
-"""
-Window function: returns the special frame boundary that represents the 
last row
-in the window partition.
-"""
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.functions.unboundedFollowing())
-
-
-@since(2.4)
-def currentRow():
-"""
-Window function: returns the special frame boundary that represents the 
current row
-in the window partition.
-"""
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.functions.currentRow())
-
-
 # -- Date/Timestamp functions 
--
 
 @since(1.5)

http://git-wip-us.apache.org/repos/asf/spark/blob/89d748b3/python/pyspark/sql/window.py
--
diff --git a/python/pyspark/sql/window.py b/python/pyspark/sql/window.py
index d19ced9..e76563d 100644
--- a/python/pyspark/sql/window.py
+++ b/python/pyspark/sql/window.py
@@ -16,11 +16,9 @@
 #
 
 import sys
-if sys.version >= '3':
-long = int
 
 from pyspark import since, SparkContext
-from pyspark.sql.column import Column, _to_seq, _to_java_column
+from pyspark.sql.column import _to_seq, _to_java_column
 
 __all__ = ["Window", "WindowSpec"]
 
@@ -126,45 +124,20 @@ class Window(object):
 and "5" means the five off after the current row.
 
 We recommend users use ``Window.unboundedPreceding``, 
``Window.unboundedFollowing``,
-``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``,
-``pyspark.sql.functions.unboundedFollowing`` and 
``pyspark.sql.functions.currentRow``
-to specify special boundary values, rather than using integral values 
directly.
+and ``Window.currentRow`` to specify special boundary values, rather 
than using integral
+values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``,
-  a column returned by 
``pyspark.sql.functions.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``,
-a column returned by 
``pyspark.sql.functions.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``, or
 any value

spark git commit: [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker

2018-10-25 Thread ueshin

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 8fbf3ee91 -> 0a05cf917


[SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker

## What changes were proposed in this pull request?

There is a race condition when releasing a Python worker. If 
`ReaderIterator.handleEndOfDataSection` is not running in the task thread, when 
a task is early terminated (such as `take(N)`), the task completion listener 
may close the worker but "handleEndOfDataSection" can still put the worker into 
the worker pool to reuse.

https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
 is a patch to reproduce this issue.

I also found a user reported this in the mail list: 
http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E

This PR fixes the issue by using `compareAndSet` to make sure we will never 
return a closed worker to the work pool.

## How was this patch tested?

Jenkins.

Closes #22816 from zsxwing/fix-socket-closed.

Authored-by: Shixiong Zhu 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit 86d469aeaa492c0642db09b27bb0879ead5d7166)
Signed-off-by: Takuya UESHIN 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a05cf91
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a05cf91
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a05cf91

Branch: refs/heads/branch-2.3
Commit: 0a05cf917a2f701def38968e59217a98b6da8d8b
Parents: 8fbf3ee
Author: Shixiong Zhu 
Authored: Fri Oct 26 13:53:51 2018 +0900
Committer: Takuya UESHIN 
Committed: Fri Oct 26 13:54:55 2018 +0900

--
 .../apache/spark/api/python/PythonRunner.scala  | 21 ++--
 .../execution/python/ArrowPythonRunner.scala|  4 ++--
 .../sql/execution/python/PythonUDFRunner.scala  |  4 ++--
 3 files changed, 15 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0a05cf91/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index 754a654..f7d1461 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -84,15 +84,17 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   envVars.put("SPARK_REUSE_WORKER", "1")
 }
 val worker: Socket = env.createPythonWorker(pythonExec, 
envVars.asScala.toMap)
-// Whether is the worker released into idle pool
-val released = new AtomicBoolean(false)
+// Whether is the worker released into idle pool or closed. When any codes 
try to release or
+// close a worker, they should use `releasedOrClosed.compareAndSet` to 
flip the state to make
+// sure there is only one winner that is going to release or close the 
worker.
+val releasedOrClosed = new AtomicBoolean(false)
 
 // Start a thread to feed the process input from our parent's iterator
 val writerThread = newWriterThread(env, worker, inputIterator, 
partitionIndex, context)
 
 context.addTaskCompletionListener { _ =>
   writerThread.shutdownOnTaskCompletion()
-  if (!reuseWorker || !released.get) {
+  if (!reuseWorker || releasedOrClosed.compareAndSet(false, true)) {
 try {
   worker.close()
 } catch {
@@ -109,7 +111,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))
 
 val stdoutIterator = newReaderIterator(
-  stream, writerThread, startTime, env, worker, released, context)
+  stream, writerThread, startTime, env, worker, releasedOrClosed, context)
 new InterruptibleIterator(context, stdoutIterator)
   }
 
@@ -126,7 +128,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   startTime: Long,
   env: SparkEnv,
   worker: Socket,
-  released: AtomicBoolean,
+  releasedOrClosed: AtomicBoolean,
   context: TaskContext): Iterator[OUT]
 
   /**
@@ -272,7 +274,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   startTime: Long,
   env: SparkEnv,
   worker: Socket,
-  released: AtomicBoolean,
+  releasedOrClosed: AtomicBoolean,
   context: TaskContext)
 extends Iterator[OUT] {
 
@@ -343,9 +345,8 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   }
   // Check whether the worker is ready to be re-used.
   if (stream.readInt() == SpecialLengths.END_OF_STREAM) {
-if (reuseWorker) {
+if (reuseWorker &&

spark git commit: [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker

2018-10-25 Thread ueshin

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 adfd1057d -> eff1c5016


[SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker

## What changes were proposed in this pull request?

There is a race condition when releasing a Python worker. If 
`ReaderIterator.handleEndOfDataSection` is not running in the task thread, when 
a task is early terminated (such as `take(N)`), the task completion listener 
may close the worker but "handleEndOfDataSection" can still put the worker into 
the worker pool to reuse.

https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
 is a patch to reproduce this issue.

I also found a user reported this in the mail list: 
http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E

This PR fixes the issue by using `compareAndSet` to make sure we will never 
return a closed worker to the work pool.

## How was this patch tested?

Jenkins.

Closes #22816 from zsxwing/fix-socket-closed.

Authored-by: Shixiong Zhu 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit 86d469aeaa492c0642db09b27bb0879ead5d7166)
Signed-off-by: Takuya UESHIN 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eff1c501
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eff1c501
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eff1c501

Branch: refs/heads/branch-2.4
Commit: eff1c5016cc78ad3d26e5e83cd0d72c56c35cf0a
Parents: adfd105
Author: Shixiong Zhu 
Authored: Fri Oct 26 13:53:51 2018 +0900
Committer: Takuya UESHIN 
Committed: Fri Oct 26 13:54:14 2018 +0900

--
 .../apache/spark/api/python/PythonRunner.scala  | 21 ++--
 .../execution/python/ArrowPythonRunner.scala|  4 ++--
 .../sql/execution/python/PythonUDFRunner.scala  |  4 ++--
 3 files changed, 15 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eff1c501/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index 6e53a04..f73e95e 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -106,15 +106,17 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   envVars.put("PYSPARK_EXECUTOR_MEMORY_MB", memoryMb.get.toString)
 }
 val worker: Socket = env.createPythonWorker(pythonExec, 
envVars.asScala.toMap)
-// Whether is the worker released into idle pool
-val released = new AtomicBoolean(false)
+// Whether is the worker released into idle pool or closed. When any codes 
try to release or
+// close a worker, they should use `releasedOrClosed.compareAndSet` to 
flip the state to make
+// sure there is only one winner that is going to release or close the 
worker.
+val releasedOrClosed = new AtomicBoolean(false)
 
 // Start a thread to feed the process input from our parent's iterator
 val writerThread = newWriterThread(env, worker, inputIterator, 
partitionIndex, context)
 
 context.addTaskCompletionListener[Unit] { _ =>
   writerThread.shutdownOnTaskCompletion()
-  if (!reuseWorker || !released.get) {
+  if (!reuseWorker || releasedOrClosed.compareAndSet(false, true)) {
 try {
   worker.close()
 } catch {
@@ -131,7 +133,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))
 
 val stdoutIterator = newReaderIterator(
-  stream, writerThread, startTime, env, worker, released, context)
+  stream, writerThread, startTime, env, worker, releasedOrClosed, context)
 new InterruptibleIterator(context, stdoutIterator)
   }
 
@@ -148,7 +150,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   startTime: Long,
   env: SparkEnv,
   worker: Socket,
-  released: AtomicBoolean,
+  releasedOrClosed: AtomicBoolean,
   context: TaskContext): Iterator[OUT]
 
   /**
@@ -392,7 +394,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   startTime: Long,
   env: SparkEnv,
   worker: Socket,
-  released: AtomicBoolean,
+  releasedOrClosed: AtomicBoolean,
   context: TaskContext)
 extends Iterator[OUT] {
 
@@ -463,9 +465,8 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   }
   // Check whether the worker is ready to be re-used.
   if (stream.readInt() == SpecialLengths.END_OF_STREAM) {
-if (reuseWorker) {
+if (reuseWorker &&

spark git commit: [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker

2018-10-25 Thread ueshin

Repository: spark
Updated Branches:
  refs/heads/master 24e8c27df -> 86d469aea


[SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker

## What changes were proposed in this pull request?

There is a race condition when releasing a Python worker. If 
`ReaderIterator.handleEndOfDataSection` is not running in the task thread, when 
a task is early terminated (such as `take(N)`), the task completion listener 
may close the worker but "handleEndOfDataSection" can still put the worker into 
the worker pool to reuse.

https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
 is a patch to reproduce this issue.

I also found a user reported this in the mail list: 
http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E

This PR fixes the issue by using `compareAndSet` to make sure we will never 
return a closed worker to the work pool.

## How was this patch tested?

Jenkins.

Closes #22816 from zsxwing/fix-socket-closed.

Authored-by: Shixiong Zhu 
Signed-off-by: Takuya UESHIN 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86d469ae
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86d469ae
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86d469ae

Branch: refs/heads/master
Commit: 86d469aeaa492c0642db09b27bb0879ead5d7166
Parents: 24e8c27
Author: Shixiong Zhu 
Authored: Fri Oct 26 13:53:51 2018 +0900
Committer: Takuya UESHIN 
Committed: Fri Oct 26 13:53:51 2018 +0900

--
 .../apache/spark/api/python/PythonRunner.scala  | 21 ++--
 .../execution/python/ArrowPythonRunner.scala|  4 ++--
 .../sql/execution/python/PythonUDFRunner.scala  |  4 ++--
 3 files changed, 15 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/86d469ae/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index 6e53a04..f73e95e 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -106,15 +106,17 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   envVars.put("PYSPARK_EXECUTOR_MEMORY_MB", memoryMb.get.toString)
 }
 val worker: Socket = env.createPythonWorker(pythonExec, 
envVars.asScala.toMap)
-// Whether is the worker released into idle pool
-val released = new AtomicBoolean(false)
+// Whether is the worker released into idle pool or closed. When any codes 
try to release or
+// close a worker, they should use `releasedOrClosed.compareAndSet` to 
flip the state to make
+// sure there is only one winner that is going to release or close the 
worker.
+val releasedOrClosed = new AtomicBoolean(false)
 
 // Start a thread to feed the process input from our parent's iterator
 val writerThread = newWriterThread(env, worker, inputIterator, 
partitionIndex, context)
 
 context.addTaskCompletionListener[Unit] { _ =>
   writerThread.shutdownOnTaskCompletion()
-  if (!reuseWorker || !released.get) {
+  if (!reuseWorker || releasedOrClosed.compareAndSet(false, true)) {
 try {
   worker.close()
 } catch {
@@ -131,7 +133,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))
 
 val stdoutIterator = newReaderIterator(
-  stream, writerThread, startTime, env, worker, released, context)
+  stream, writerThread, startTime, env, worker, releasedOrClosed, context)
 new InterruptibleIterator(context, stdoutIterator)
   }
 
@@ -148,7 +150,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   startTime: Long,
   env: SparkEnv,
   worker: Socket,
-  released: AtomicBoolean,
+  releasedOrClosed: AtomicBoolean,
   context: TaskContext): Iterator[OUT]
 
   /**
@@ -392,7 +394,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   startTime: Long,
   env: SparkEnv,
   worker: Socket,
-  released: AtomicBoolean,
+  releasedOrClosed: AtomicBoolean,
   context: TaskContext)
 extends Iterator[OUT] {
 
@@ -463,9 +465,8 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   }
   // Check whether the worker is ready to be re-used.
   if (stream.readInt() == SpecialLengths.END_OF_STREAM) {
-if (reuseWorker) {
+if (reuseWorker && releasedOrClosed.compareAndSet(false, true)) {
   env.releasePythonWorker(pythonExec,

spark git commit: [SPARK-25819][SQL] Support parse mode option for the function `from_avro`

2018-10-25 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 79f3babcc -> 24e8c27df


[SPARK-25819][SQL] Support parse mode option for the function `from_avro`

## What changes were proposed in this pull request?

Current the function `from_avro` throws exception on reading corrupt records.
In practice, there could be various reasons of data corruption. It would be 
good to support `PERMISSIVE` mode and allow the function from_avro to process 
all the input file/streaming, which is consistent with from_json and from_csv. 
There is no obvious down side for supporting `PERMISSIVE` mode.

Different from `from_csv` and `from_json`, the default parse mode is `FAILFAST` 
for the following reasons:
1. Since Avro is structured data format, input data is usually able to be 
parsed by certain schema.  In such case, exposing the problems of input data to 
users is better than hiding it.
2. For `PERMISSIVE` mode, we have to force the data schema as fully nullable. 
This seems quite unnecessary for Avro. Reversing non-null schema might archive 
more perf optimizations in Spark.
3. To be consistent with the behavior in Spark 2.4 .

## How was this patch tested?

Unit test

Manual previewing generated html for the Avro data source doc:

![image](https://user-images.githubusercontent.com/1097932/47510100-02558880-d8aa-11e8-9d57-a43daee4c6b9.png)

Closes #22814 from gengliangwang/improve_from_avro.

Authored-by: Gengliang Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/24e8c27d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/24e8c27d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/24e8c27d

Branch: refs/heads/master
Commit: 24e8c27dfe31e6e0a53c89e6ddc36327e537931b
Parents: 79f3bab
Author: Gengliang Wang 
Authored: Fri Oct 26 11:39:38 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 26 11:39:38 2018 +0800

--
 docs/sql-data-sources-avro.md   | 18 +++-
 .../spark/sql/avro/AvroDataToCatalyst.scala | 90 +---
 .../org/apache/spark/sql/avro/AvroOptions.scala | 16 +++-
 .../org/apache/spark/sql/avro/package.scala | 28 +-
 .../avro/AvroCatalystDataConversionSuite.scala  | 58 +++--
 .../spark/sql/avro/AvroFunctionsSuite.scala | 36 +++-
 6 files changed, 219 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/docs/sql-data-sources-avro.md
--
diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md
index d3b81f0..bfe641d 100644
--- a/docs/sql-data-sources-avro.md
+++ b/docs/sql-data-sources-avro.md
@@ -142,7 +142,10 @@ StreamingQuery query = output
 
 ## Data Source Option
 
-Data source options of Avro can be set using the `.option` method on 
`DataFrameReader` or `DataFrameWriter`.
+Data source options of Avro can be set via:
+ * the `.option` method on `DataFrameReader` or `DataFrameWriter`.
+ * the `options` parameter in function `from_avro`.
+
 
   Property 
NameDefaultMeaningScope
   
@@ -177,6 +180,19 @@ Data source options of Avro can be set using the `.option` 
method on `DataFrameR
   Currently supported codecs are uncompressed, 
snappy, deflate, bzip2 and 
xz. If the option is not set, the configuration 
spark.sql.avro.compression.codec config is taken into account.
 write
   
+  
+mode
+FAILFAST
+The mode option allows to specify parse mode for function 
from_avro.
+  Currently supported modes are:
+  
+FAILFAST: Throws an exception on processing corrupted 
record.
+PERMISSIVE: Corrupt records are processed as null 
result. Therefore, the
+data schema is forced to be fully nullable, which might be different 
from the one user provided.
+  
+
+function from_avro
+  
 
 
 ## Configuration

http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index 915769f..43d3f6e 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -17,20 +17,37 @@
 
 package org.apache.spark.sql.avro
 
+import scala.util.control.NonFatal
+
 import org.apache.avro.Schema
 import org.apache.avro.generic.GenericDatumReader
 import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
 
-import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import

spark git commit: [MINOR][TEST][BRANCH-2.4] Regenerate golden file `datetime.sql.out`

2018-10-25 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 b739fb0d7 -> adfd1057d


[MINOR][TEST][BRANCH-2.4] Regenerate golden file `datetime.sql.out`

## What changes were proposed in this pull request?

`datetime.sql.out` is a generated golden file, but it's a little bit broken 
during manual 
[reverting](https://github.com/dongjoon-hyun/spark/commit/5d744499667fcd08825bca0ac6d5d90d6e110ebc#diff-79dd276be45ede6f34e24ad7005b0a7cR87).
 This doens't cause test failure because the difference is inside `comments` 
and blank lines. We had better fix this minor issue before RC5.

## How was this patch tested?

Pass the Jenkins.

Closes #22837 from dongjoon-hyun/fix_datetime_sql_out.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/adfd1057
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/adfd1057
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/adfd1057

Branch: refs/heads/branch-2.4
Commit: adfd1057dae3b48c05d7443e3aee23157965e6d1
Parents: b739fb0
Author: Dongjoon Hyun 
Authored: Thu Oct 25 20:37:07 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 25 20:37:07 2018 -0700

--
 sql/core/src/test/resources/sql-tests/results/datetime.sql.out | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/adfd1057/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
--
diff --git a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out 
b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
index 4e1cfa6..63aa004 100644
--- a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
@@ -82,9 +82,10 @@ struct
 1  2
 2  3
 
+
 -- !query 9
 select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), 
weekday(null), weekday('1582-10-15 13:10:15')
--- !query 3 schema
+-- !query 9 schema
 struct
--- !query 3 output
+-- !query 9 output
 5  3   5   NULL4


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/master dc9b32080 -> 79f3babcc


[SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing 
LICENSE-binary

## What changes were proposed in this pull request?

We vote for the artifacts. All releases are in the form of the source materials 
needed to make changes to the software being released. 
(http://www.apache.org/legal/release-policy.html#artifacts)

>From Spark 2.4.0, the source artifact and binary artifact starts to contain 
>own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
>However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
>start to fail because it expects `LICENSE-binary` and source artifact have 
>only the LICENSE file.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

`dev/make-distribution.sh` is used during the voting phase because we are 
voting on that source artifact instead of GitHub repository. Individual 
contributors usually don't have the downstream repository and starts to try 
build the voting source artifacts to help the verification for the source 
artifact during voting phase. (Personally, I did before.)

This PR aims to recover that script to work in any way. This doesn't aim for 
source artifacts to reproduce the compiled artifacts.

## How was this patch tested?

Manual.
```
$ rm LICENSE-binary
$ dev/make-distribution.sh
```

Closes #22840 from dongjoon-hyun/SPARK-25840.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79f3babc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79f3babc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79f3babc

Branch: refs/heads/master
Commit: 79f3babcc6e189d7405464b9ac1eb1c017e51f5d
Parents: dc9b320
Author: Dongjoon Hyun 
Authored: Thu Oct 25 20:26:13 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 25 20:26:13 2018 -0700

--
 dev/make-distribution.sh | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79f3babc/dev/make-distribution.sh
--
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index 668682f..84f4ae9 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -212,9 +212,13 @@ mkdir -p "$DISTDIR/examples/src/main"
 cp -r "$SPARK_HOME/examples/src/main" "$DISTDIR/examples/src/"
 
 # Copy license and ASF files
-cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE"
-cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses"
-cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE"
+if [ -e "$SPARK_HOME/LICENSE-binary" ]; then
+  cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE"
+  cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses"
+  cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE"
+else
+  echo "Skipping copying LICENSE files"
+fi
 
 if [ -e "$SPARK_HOME/CHANGES.txt" ]; then
   cp "$SPARK_HOME/CHANGES.txt" "$DISTDIR"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 39e108f16 -> b739fb0d7


[SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing 
LICENSE-binary

## What changes were proposed in this pull request?

We vote for the artifacts. All releases are in the form of the source materials 
needed to make changes to the software being released. 
(http://www.apache.org/legal/release-policy.html#artifacts)

>From Spark 2.4.0, the source artifact and binary artifact starts to contain 
>own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
>However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
>start to fail because it expects `LICENSE-binary` and source artifact have 
>only the LICENSE file.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

`dev/make-distribution.sh` is used during the voting phase because we are 
voting on that source artifact instead of GitHub repository. Individual 
contributors usually don't have the downstream repository and starts to try 
build the voting source artifacts to help the verification for the source 
artifact during voting phase. (Personally, I did before.)

This PR aims to recover that script to work in any way. This doesn't aim for 
source artifacts to reproduce the compiled artifacts.

## How was this patch tested?

Manual.
```
$ rm LICENSE-binary
$ dev/make-distribution.sh
```

Closes #22840 from dongjoon-hyun/SPARK-25840.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 79f3babcc6e189d7405464b9ac1eb1c017e51f5d)
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b739fb0d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b739fb0d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b739fb0d

Branch: refs/heads/branch-2.4
Commit: b739fb0d783adad68e7197caaa931a83eb1725bd
Parents: 39e108f
Author: Dongjoon Hyun 
Authored: Thu Oct 25 20:26:13 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 25 20:26:26 2018 -0700

--
 dev/make-distribution.sh | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b739fb0d/dev/make-distribution.sh
--
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index 668682f..84f4ae9 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -212,9 +212,13 @@ mkdir -p "$DISTDIR/examples/src/main"
 cp -r "$SPARK_HOME/examples/src/main" "$DISTDIR/examples/src/"
 
 # Copy license and ASF files
-cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE"
-cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses"
-cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE"
+if [ -e "$SPARK_HOME/LICENSE-binary" ]; then
+  cp "$SPARK_HOME/LICENSE-binary" "$DISTDIR/LICENSE"
+  cp -r "$SPARK_HOME/licenses-binary" "$DISTDIR/licenses"
+  cp "$SPARK_HOME/NOTICE-binary" "$DISTDIR/NOTICE"
+else
+  echo "Skipping copying LICENSE files"
+fi
 
 if [ -e "$SPARK_HOME/CHANGES.txt" ]; then
   cp "$SPARK_HOME/CHANGES.txt" "$DISTDIR"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30415 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_20_03-72a23a6-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Fri Oct 26 03:17:26 2018
New Revision: 30415

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_25_20_03-72a23a6 docs


[This commit notification would consist of 1474 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 a9f200e11 -> 39e108f16


[SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0

## What changes were proposed in this pull request?
The following code in BisectingKMeansModel.load calls the wrong version of load.
```
  case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
val model = SaveLoadV1_0.load(sc, path)
```

Closes #22790 from huaxingao/spark-25793.

Authored-by: Huaxin Gao 
Signed-off-by: Wenchen Fan 
(cherry picked from commit dc9b320807881403ca9f1e2e6d01de4b52db3975)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39e108f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39e108f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39e108f1

Branch: refs/heads/branch-2.4
Commit: 39e108f168128abc5e0f369645b699a87ce6c91b
Parents: a9f200e
Author: Huaxin Gao 
Authored: Fri Oct 26 11:07:55 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 26 11:08:51 2018 +0800

--
 .../apache/spark/mllib/clustering/BisectingKMeansModel.scala   | 6 +++---
 .../apache/spark/mllib/clustering/BisectingKMeansSuite.scala   | 3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39e108f1/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
index 9d115af..4c5794f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
@@ -109,10 +109,10 @@ class BisectingKMeansModel private[clustering] (
 
   @Since("2.0.0")
   override def save(sc: SparkContext, path: String): Unit = {
-BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path)
+BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path)
   }
 
-  override protected def formatVersion: String = "1.0"
+  override protected def formatVersion: String = "2.0"
 }
 
 @Since("2.0.0")
@@ -126,7 +126,7 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
 val model = SaveLoadV1_0.load(sc, path)
 model
   case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
-val model = SaveLoadV1_0.load(sc, path)
+val model = SaveLoadV2_0.load(sc, path)
 model
   case _ => throw new Exception(
 s"BisectingKMeansModel.load did not recognize model with (className, 
format version):" +

http://git-wip-us.apache.org/repos/asf/spark/blob/39e108f1/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
index 35f7932..4a4d8b5 100644
--- 
a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
@@ -187,11 +187,12 @@ class BisectingKMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
 val points = (1 until 8).map(i => Vectors.dense(i))
 val data = sc.parallelize(points, 2)
-val model = new BisectingKMeans().run(data)
+val model = new 
BisectingKMeans().setDistanceMeasure(DistanceMeasure.COSINE).run(data)
 try {
   model.save(sc, path)
   val sameModel = BisectingKMeansModel.load(sc, path)
   assert(model.k === sameModel.k)
+  assert(model.distanceMeasure === sameModel.distanceMeasure)
   model.clusterCenters.zip(sameModel.clusterCenters).foreach(c => c._1 === 
c._2)
 } finally {
   Utils.deleteRecursively(tempDir)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 72a23a6c4 -> dc9b32080


[SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0

## What changes were proposed in this pull request?
The following code in BisectingKMeansModel.load calls the wrong version of load.
```
  case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
val model = SaveLoadV1_0.load(sc, path)
```

Closes #22790 from huaxingao/spark-25793.

Authored-by: Huaxin Gao 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc9b3208
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc9b3208
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc9b3208

Branch: refs/heads/master
Commit: dc9b320807881403ca9f1e2e6d01de4b52db3975
Parents: 72a23a6
Author: Huaxin Gao 
Authored: Fri Oct 26 11:07:55 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 26 11:07:55 2018 +0800

--
 .../apache/spark/mllib/clustering/BisectingKMeansModel.scala   | 6 +++---
 .../apache/spark/mllib/clustering/BisectingKMeansSuite.scala   | 3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dc9b3208/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
index 9d115af..4c5794f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
@@ -109,10 +109,10 @@ class BisectingKMeansModel private[clustering] (
 
   @Since("2.0.0")
   override def save(sc: SparkContext, path: String): Unit = {
-BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path)
+BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path)
   }
 
-  override protected def formatVersion: String = "1.0"
+  override protected def formatVersion: String = "2.0"
 }
 
 @Since("2.0.0")
@@ -126,7 +126,7 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
 val model = SaveLoadV1_0.load(sc, path)
 model
   case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
-val model = SaveLoadV1_0.load(sc, path)
+val model = SaveLoadV2_0.load(sc, path)
 model
   case _ => throw new Exception(
 s"BisectingKMeansModel.load did not recognize model with (className, 
format version):" +

http://git-wip-us.apache.org/repos/asf/spark/blob/dc9b3208/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
index 35f7932..4a4d8b5 100644
--- 
a/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/mllib/clustering/BisectingKMeansSuite.scala
@@ -187,11 +187,12 @@ class BisectingKMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
 val points = (1 until 8).map(i => Vectors.dense(i))
 val data = sc.parallelize(points, 2)
-val model = new BisectingKMeans().run(data)
+val model = new 
BisectingKMeans().setDistanceMeasure(DistanceMeasure.COSINE).run(data)
 try {
   model.save(sc, path)
   val sameModel = BisectingKMeansModel.load(sc, path)
   assert(model.k === sameModel.k)
+  assert(model.distanceMeasure === sameModel.distanceMeasure)
   model.clusterCenters.zip(sameModel.clusterCenters).foreach(c => c._1 === 
c._2)
 } finally {
   Utils.deleteRecursively(tempDir)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25772][SQL][FOLLOWUP] remove GetArrayFromMap

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 46d2d2c74 -> 72a23a6c4


[SPARK-25772][SQL][FOLLOWUP] remove GetArrayFromMap

## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/22745 we introduced the 
`GetArrayFromMap` expression. Later on I realized this is duplicated as we 
already have `MapKeys` and `MapValues`.

This PR removes `GetArrayFromMap`

## How was this patch tested?

existing tests

Closes #22825 from cloud-fan/minor.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72a23a6c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72a23a6c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72a23a6c

Branch: refs/heads/master
Commit: 72a23a6c43fe1b5a6583ea6b35b4fbb08474abbe
Parents: 46d2d2c
Author: Wenchen Fan 
Authored: Fri Oct 26 10:19:35 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 26 10:19:35 2018 +0800

--
 .../spark/sql/catalyst/JavaTypeInference.scala  |  6 +-
 .../catalyst/expressions/objects/objects.scala  | 75 
 2 files changed, 3 insertions(+), 78 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/72a23a6c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
index f32e080..8ef8b2b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
@@ -26,7 +26,7 @@ import scala.language.existentials
 
 import com.google.common.reflect.TypeToken
 
-import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, 
UnresolvedAttribute, UnresolvedExtractValue}
+import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, 
UnresolvedExtractValue}
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.objects._
 import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, 
GenericArrayData}
@@ -280,7 +280,7 @@ object JavaTypeInference {
   Invoke(
 UnresolvedMapObjects(
   p => deserializerFor(keyType, p),
-  GetKeyArrayFromMap(path)),
+  MapKeys(path)),
 "array",
 ObjectType(classOf[Array[Any]]))
 
@@ -288,7 +288,7 @@ object JavaTypeInference {
   Invoke(
 UnresolvedMapObjects(
   p => deserializerFor(valueType, p),
-  GetValueArrayFromMap(path)),
+  MapValues(path)),
 "array",
 ObjectType(classOf[Array[Any]]))
 

http://git-wip-us.apache.org/repos/asf/spark/blob/72a23a6c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index 5bfa485..b6f9b47 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -1788,78 +1788,3 @@ case class ValidateExternalType(child: Expression, 
expected: DataType)
 ev.copy(code = code, isNull = input.isNull)
   }
 }
-
-object GetKeyArrayFromMap {
-
-  /**
-   * Construct an instance of GetArrayFromMap case class
-   * extracting a key array from a Map expression.
-   *
-   * @param child a Map expression to extract a key array from
-   */
-  def apply(child: Expression): Expression = {
-GetArrayFromMap(
-  child,
-  "keyArray",
-  _.keyArray(),
-  { case MapType(kt, _, _) => kt })
-  }
-}
-
-object GetValueArrayFromMap {
-
-  /**
-   * Construct an instance of GetArrayFromMap case class
-   * extracting a value array from a Map expression.
-   *
-   * @param child a Map expression to extract a value array from
-   */
-  def apply(child: Expression): Expression = {
-GetArrayFromMap(
-  child,
-  "valueArray",
-  _.valueArray(),
-  { case MapType(_, vt, _) => vt })
-  }
-}
-
-/**
- * Extracts a key/value array from a Map expression.
- *
- * @param child a Map expression to extract an array from
- * @param functionName name of the function that is invoked to extract an array
- * @param arrayGetter function extracting `ArrayData` from `MapData`
- * @param elementTypeGetter function

svn commit: r30414 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_18_02-a9f200e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Fri Oct 26 01:16:58 2018
New Revision: 30414

Log:
Apache Spark 2.4.1-SNAPSHOT-2018_10_25_18_02-a9f200e docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 db121a2a1 -> a9f200e11


[SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions

## What changes were proposed in this pull request?

- Revert [SPARK-23935][SQL] Adding map_entries function: 
https://github.com/apache/spark/pull/21236
- Revert [SPARK-23937][SQL] Add map_filter SQL function: 
https://github.com/apache/spark/pull/21986
- Revert [SPARK-23940][SQL] Add transform_values SQL function: 
https://github.com/apache/spark/pull/22045
- Revert [SPARK-23939][SQL] Add transform_keys function: 
https://github.com/apache/spark/pull/22013
- Revert [SPARK-23938][SQL] Add map_zip_with function: 
https://github.com/apache/spark/pull/22017
- Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding 
arrays_overlap, array_repeat, map_entries to SparkR: 
https://github.com/apache/spark/pull/21434/

## How was this patch tested?
The existing tests.

Closes #22827 from gatorsmile/revertMap2.4.

Authored-by: gatorsmile 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a9f200e1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a9f200e1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a9f200e1

Branch: refs/heads/branch-2.4
Commit: a9f200e11da3d26158f9f75e48756d47d61bfacb
Parents: db121a2
Author: gatorsmile 
Authored: Fri Oct 26 07:38:55 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 26 07:38:55 2018 +0800

--
 R/pkg/NAMESPACE |   1 -
 R/pkg/R/functions.R |  15 +-
 R/pkg/R/generics.R  |   4 -
 R/pkg/tests/fulltests/test_sparkSQL.R   |   7 +-
 python/pyspark/sql/functions.py |  20 -
 .../catalyst/analysis/FunctionRegistry.scala|   5 -
 .../sql/catalyst/analysis/TypeCoercion.scala|  25 --
 .../expressions/collectionOperations.scala  | 168 
 .../expressions/higherOrderFunctions.scala  | 330 --
 .../CollectionExpressionsSuite.scala|  24 --
 .../expressions/HigherOrderFunctionsSuite.scala | 315 --
 .../scala/org/apache/spark/sql/functions.scala  |   7 -
 .../sql-tests/inputs/higher-order-functions.sql |  23 -
 .../inputs/typeCoercion/native/mapZipWith.sql   |  78 
 .../results/higher-order-functions.sql.out  |  66 +--
 .../typeCoercion/native/mapZipWith.sql.out  | 179 
 .../spark/sql/DataFrameFunctionsSuite.scala | 425 ---
 17 files changed, 3 insertions(+), 1689 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a9f200e1/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 96ff389..d77c62a 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -313,7 +313,6 @@ exportMethods("%<=>%",
   "lower",
   "lpad",
   "ltrim",
-  "map_entries",
   "map_from_arrays",
   "map_keys",
   "map_values",

http://git-wip-us.apache.org/repos/asf/spark/blob/a9f200e1/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 63bd427..1e70244 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -219,7 +219,7 @@ NULL
 #' head(select(tmp, sort_array(tmp$v1)))
 #' head(select(tmp, sort_array(tmp$v1, asc = FALSE)))
 #' tmp3 <- mutate(df, v3 = create_map(df$model, df$cyl))
-#' head(select(tmp3, map_entries(tmp3$v3), map_keys(tmp3$v3), 
map_values(tmp3$v3)))
+#' head(select(tmp3, map_keys(tmp3$v3), map_values(tmp3$v3)))
 #' head(select(tmp3, element_at(tmp3$v3, "Valiant")))
 #' tmp4 <- mutate(df, v4 = create_array(df$mpg, df$cyl), v5 = 
create_array(df$cyl, df$hp))
 #' head(select(tmp4, concat(tmp4$v4, tmp4$v5), arrays_overlap(tmp4$v4, 
tmp4$v5)))
@@ -3253,19 +3253,6 @@ setMethod("flatten",
   })
 
 #' @details
-#' \code{map_entries}: Returns an unordered array of all entries in the given 
map.
-#'
-#' @rdname column_collection_functions
-#' @aliases map_entries map_entries,Column-method
-#' @note map_entries since 2.4.0
-setMethod("map_entries",
-  signature(x = "Column"),
-  function(x) {
-jc <- callJStatic("org.apache.spark.sql.functions", "map_entries", 
x@jc)
-column(jc)
- })
-
-#' @details
 #' \code{map_from_arrays}: Creates a new map column. The array in the first 
column is used for
 #' keys. The array in the second column is used for values. All elements in 
the array for key
 #' should not be null.

http://git-wip-us.apache.org/repos/asf/spark/blob/a9f200e1/R/pkg/R/generics.R
--
diff

svn commit: r30413 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_16_03-46d2d2c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Thu Oct 25 23:17:26 2018
New Revision: 30413

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_25_16_03-46d2d2c docs


[This commit notification would consist of 1474 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30411 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_14_02-45ed76d-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Thu Oct 25 21:17:10 2018
New Revision: 30411

Log:
Apache Spark 2.4.1-SNAPSHOT-2018_10_25_14_02-45ed76d docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra data source options

2018-10-25 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 1b075f26f -> db121a2a1


[SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra 
data source options

## What changes were proposed in this pull request?

Our current doc does not explain how we are passing the data source specific 
options to the underlying data source. According to [the review 
comment](https://github.com/apache/spark/pull/22622#discussion_r222911529), 
this PR aims to add more detailed information and examples. This is a backport 
of #22801. `orc.column.encoding.direct` is removed since it's not supported in 
ORC 1.5.2.

## How was this patch tested?

Manual.

Closes #22839 from dongjoon-hyun/SPARK-25656-2.4.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db121a2a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db121a2a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db121a2a

Branch: refs/heads/branch-2.4
Commit: db121a2a1fde96fe77eedff18706df5c8e2e731d
Parents: 1b075f2
Author: Dongjoon Hyun 
Authored: Thu Oct 25 14:15:03 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 25 14:15:03 2018 -0700

--
 docs/sql-data-sources-load-save-functions.md|  43 +++
 .../examples/sql/JavaSQLDataSourceExample.java  |   6 +++
 examples/src/main/python/sql/datasource.py  |   8 
 examples/src/main/r/RSparkSQLExample.R  |   6 ++-
 examples/src/main/resources/users.orc   | Bin 0 -> 547 bytes
 .../examples/sql/SQLDataSourceExample.scala |   6 +++
 6 files changed, 68 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/db121a2a/docs/sql-data-sources-load-save-functions.md
--
diff --git a/docs/sql-data-sources-load-save-functions.md 
b/docs/sql-data-sources-load-save-functions.md
index e1dd0a3..a3191b2 100644
--- a/docs/sql-data-sources-load-save-functions.md
+++ b/docs/sql-data-sources-load-save-functions.md
@@ -82,6 +82,49 @@ To load a CSV file you can use:
 
 
 
+The extra options are also used during write operation.
+For example, you can control bloom filters and dictionary encodings for ORC 
data sources.
+The following ORC example will create bloom filter on `favorite_color` and use 
dictionary encoding for `name` and `favorite_color`.
+For Parquet, there exists `parquet.enable.dictionary`, too.
+To find more detailed information about the extra ORC/Parquet options,
+visit the official Apache ORC/Parquet websites.
+
+
+
+
+{% include_example manual_save_options_orc 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example manual_save_options_orc 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example manual_save_options_orc python/sql/datasource.py %}
+
+
+
+{% include_example manual_save_options_orc r/RSparkSQLExample.R %}
+
+
+
+
+{% highlight sql %}
+CREATE TABLE users_with_options (
+  name STRING,
+  favorite_color STRING,
+  favorite_numbers array
+) USING ORC
+OPTIONS (
+  orc.bloom.filter.columns 'favorite_color',
+  orc.dictionary.key.threshold '1.0'
+)
+{% endhighlight %}
+
+
+
+
+
 ### Run SQL on files directly
 
 Instead of using read API to load a file into DataFrame and query it, you can 
also query that

http://git-wip-us.apache.org/repos/asf/spark/blob/db121a2a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
 
b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
index ef3c904..97e9ca3 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
@@ -123,6 +123,12 @@ public class JavaSQLDataSourceExample {
   .option("header", "true")
   .load("examples/src/main/resources/people.csv");
 // $example off:manual_load_options_csv$
+// $example on:manual_save_options_orc$
+usersDF.write().format("orc")
+  .option("orc.bloom.filter.columns", "favorite_color")
+  .option("orc.dictionary.key.threshold", "1.0")
+  .save("users_with_options.orc");
+// $example off:manual_save_options_orc$
 // $example on:direct_sql$
 Dataset sqlDF =
   spark.sql("SELECT * FROM 
parquet.`examples/src/main/resources/users.parquet`");

http://git-wip-us.apache.org/repos/asf/spark/blob/db121a2a/examples/src/main/python/sql/datasource.py
--
diff --git

spark git commit: [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs

2018-10-25 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 45ed76d6f -> 1b075f26f


[SPARK-24787][CORE] Revert hsync in EventLoggingListener and make 
FsHistoryProvider to read lastBlockBeingWritten data for logs

## What changes were proposed in this pull request?

`hsync` has been added as part of SPARK-19531 to get the latest data in the 
history sever ui, but that is causing the performance overhead and also leading 
to drop many history log events. `hsync` uses the force `FileChannel.force` to 
sync the data to the disk and happens for the data pipeline, it is costly 
operation and making the application to face overhead and drop the events.

I think getting the latest data in history server can be done in different way 
(no impact to application while writing events), there is an api 
`DFSInputStream.getFileLength()` which gives the file length including the 
`lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api 
can be used when the file status length and previously cached length are equal 
to verify whether any new data has been written or not, if there is any update 
in data length then the history server can update the in progress history log. 
And also I made this change as configurable with the default value false, and 
can be enabled for history server if users want to see the updated data in ui.

## How was this patch tested?

Added new test and verified manually, with the added conf 
`spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server 
is reading the logs including the last block data which is being written and 
updating the Web UI with the latest data.

Closes #22752 from devaraj-kavali/SPARK-24787.

Authored-by: Devaraj K 
Signed-off-by: Marcelo Vanzin 
(cherry picked from commit 46d2d2c74d9aaf30e158aeda58a189f6c8e48b9c)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b075f26
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b075f26
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b075f26

Branch: refs/heads/branch-2.4
Commit: 1b075f26f2aa3db93e168c8b8bb5d67e96ffb490
Parents: 45ed76d
Author: Devaraj K 
Authored: Thu Oct 25 13:16:08 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu Oct 25 13:16:18 2018 -0700

--
 .../deploy/history/FsHistoryProvider.scala  | 22 ++--
 .../spark/scheduler/EventLoggingListener.scala  |  8 +
 .../deploy/history/FsHistoryProviderSuite.scala | 37 ++--
 3 files changed, 56 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1b075f26/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
index c23a659..c4517d3 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
@@ -34,7 +34,7 @@ import com.fasterxml.jackson.annotation.JsonIgnore
 import com.google.common.io.ByteStreams
 import com.google.common.util.concurrent.MoreExecutors
 import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
-import org.apache.hadoop.hdfs.DistributedFileSystem
+import org.apache.hadoop.hdfs.{DFSInputStream, DistributedFileSystem}
 import org.apache.hadoop.hdfs.protocol.HdfsConstants
 import org.apache.hadoop.security.AccessControlException
 import org.fusesource.leveldbjni.internal.NativeDB
@@ -449,7 +449,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
   listing.write(info.copy(lastProcessed = newLastScanTime, 
fileSize = entry.getLen()))
 }
 
-if (info.fileSize < entry.getLen()) {
+if (shouldReloadLog(info, entry)) {
   if (info.appId.isDefined && fastInProgressParsing) {
 // When fast in-progress parsing is on, we don't need to 
re-parse when the
 // size changes, but we do need to invalidate any existing UIs.
@@ -541,6 +541,24 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
 }
   }
 
+  private[history] def shouldReloadLog(info: LogInfo, entry: FileStatus): 
Boolean = {
+var result = info.fileSize < entry.getLen
+if (!result && info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
+  try {
+result = Utils.tryWithResource(fs.open(entry.getPath)) { in =>
+  in.getWrappedStream match {
+case dfsIn: DFSInputStream => info.fileSize < dfsIn.getFileLength
+case _ => false
+  }
+}
+  } catch {
+case e: Exception =>

spark git commit: [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs

2018-10-25 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master 9b98d9166 -> 46d2d2c74


[SPARK-24787][CORE] Revert hsync in EventLoggingListener and make 
FsHistoryProvider to read lastBlockBeingWritten data for logs

## What changes were proposed in this pull request?

`hsync` has been added as part of SPARK-19531 to get the latest data in the 
history sever ui, but that is causing the performance overhead and also leading 
to drop many history log events. `hsync` uses the force `FileChannel.force` to 
sync the data to the disk and happens for the data pipeline, it is costly 
operation and making the application to face overhead and drop the events.

I think getting the latest data in history server can be done in different way 
(no impact to application while writing events), there is an api 
`DFSInputStream.getFileLength()` which gives the file length including the 
`lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api 
can be used when the file status length and previously cached length are equal 
to verify whether any new data has been written or not, if there is any update 
in data length then the history server can update the in progress history log. 
And also I made this change as configurable with the default value false, and 
can be enabled for history server if users want to see the updated data in ui.

## How was this patch tested?

Added new test and verified manually, with the added conf 
`spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server 
is reading the logs including the last block data which is being written and 
updating the Web UI with the latest data.

Closes #22752 from devaraj-kavali/SPARK-24787.

Authored-by: Devaraj K 
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/46d2d2c7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/46d2d2c7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/46d2d2c7

Branch: refs/heads/master
Commit: 46d2d2c74d9aaf30e158aeda58a189f6c8e48b9c
Parents: 9b98d91
Author: Devaraj K 
Authored: Thu Oct 25 13:16:08 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu Oct 25 13:16:08 2018 -0700

--
 .../deploy/history/FsHistoryProvider.scala  | 22 ++--
 .../spark/scheduler/EventLoggingListener.scala  |  8 +
 .../deploy/history/FsHistoryProviderSuite.scala | 37 ++--
 3 files changed, 56 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/46d2d2c7/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
index c23a659..c4517d3 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
@@ -34,7 +34,7 @@ import com.fasterxml.jackson.annotation.JsonIgnore
 import com.google.common.io.ByteStreams
 import com.google.common.util.concurrent.MoreExecutors
 import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
-import org.apache.hadoop.hdfs.DistributedFileSystem
+import org.apache.hadoop.hdfs.{DFSInputStream, DistributedFileSystem}
 import org.apache.hadoop.hdfs.protocol.HdfsConstants
 import org.apache.hadoop.security.AccessControlException
 import org.fusesource.leveldbjni.internal.NativeDB
@@ -449,7 +449,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
   listing.write(info.copy(lastProcessed = newLastScanTime, 
fileSize = entry.getLen()))
 }
 
-if (info.fileSize < entry.getLen()) {
+if (shouldReloadLog(info, entry)) {
   if (info.appId.isDefined && fastInProgressParsing) {
 // When fast in-progress parsing is on, we don't need to 
re-parse when the
 // size changes, but we do need to invalidate any existing UIs.
@@ -541,6 +541,24 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
 }
   }
 
+  private[history] def shouldReloadLog(info: LogInfo, entry: FileStatus): 
Boolean = {
+var result = info.fileSize < entry.getLen
+if (!result && info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
+  try {
+result = Utils.tryWithResource(fs.open(entry.getPath)) { in =>
+  in.getWrappedStream match {
+case dfsIn: DFSInputStream => info.fileSize < dfsIn.getFileLength
+case _ => false
+  }
+}
+  } catch {
+case e: Exception =>
+  logDebug(s"Failed to check the length for the file : 
${info.logPath}", e)
+  }
+}
+

spark git commit: [SPARK-25803][K8S] Fix docker-image-tool.sh -n option

2018-10-25 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 a20660b6f -> 45ed76d6f


[SPARK-25803][K8S] Fix docker-image-tool.sh -n option

## What changes were proposed in this pull request?

docker-image-tool.sh uses getopts in which a colon signifies that an
option takes an argument.  Since -n does not take an argument it
should not have a colon.

## How was this patch tested?

Following the reproduction in 
[JIRA](https://issues.apache.org/jira/browse/SPARK-25803):-

0. Created a custom Dockerfile to use for the spark-r container image.
In each of the steps below the path to this Dockerfile is passed with the '-R' 
option.
(spark-r is used here simply as an example, the bug applies to all options)

1. Built container images without '-n'.
The [result](https://gist.github.com/sel/59f0911bb1a6a485c2487cf7ca770f9d) is 
that the '-R' option is honoured and the hello-world image is built for 
spark-r, as expected.

2. Built container images with '-n' to reproduce the issue
The [result](https://gist.github.com/sel/e5cabb9f3bdad5d087349e7fbed75141) is 
that the '-R' option is ignored and the default container image for spark-r is 
built

3. Applied the patch and re-built container images with '-n' and did not 
reproduce the issue
The [result](https://gist.github.com/sel/6af14b95012ba8ff267a4fce6e3bd3bf) is 
that the '-R' option is honoured and the hello-world image is built for 
spark-r, as expected.

Closes #22798 from sel/fix-docker-image-tool-nocache.

Authored-by: Steve 
Signed-off-by: Marcelo Vanzin 
(cherry picked from commit 9b98d9166ee2c130ba38a09e8c0aa12e29676b76)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45ed76d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45ed76d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45ed76d6

Branch: refs/heads/branch-2.4
Commit: 45ed76d6f66d58c99c20d9e757aa2177d9707968
Parents: a20660b
Author: Steve 
Authored: Thu Oct 25 13:00:59 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu Oct 25 13:01:10 2018 -0700

--
 bin/docker-image-tool.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45ed76d6/bin/docker-image-tool.sh
--
diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh
index 228494d..5e8eaff 100755
--- a/bin/docker-image-tool.sh
+++ b/bin/docker-image-tool.sh
@@ -145,7 +145,7 @@ PYDOCKERFILE=
 RDOCKERFILE=
 NOCACHEARG=
 BUILD_PARAMS=
-while getopts f:p:R:mr:t:n:b: option
+while getopts f:p:R:mr:t:nb: option
 do
  case "${option}"
  in


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25803][K8S] Fix docker-image-tool.sh -n option

2018-10-25 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master ccd07b736 -> 9b98d9166


[SPARK-25803][K8S] Fix docker-image-tool.sh -n option

## What changes were proposed in this pull request?

docker-image-tool.sh uses getopts in which a colon signifies that an
option takes an argument.  Since -n does not take an argument it
should not have a colon.

## How was this patch tested?

Following the reproduction in 
[JIRA](https://issues.apache.org/jira/browse/SPARK-25803):-

0. Created a custom Dockerfile to use for the spark-r container image.
In each of the steps below the path to this Dockerfile is passed with the '-R' 
option.
(spark-r is used here simply as an example, the bug applies to all options)

1. Built container images without '-n'.
The [result](https://gist.github.com/sel/59f0911bb1a6a485c2487cf7ca770f9d) is 
that the '-R' option is honoured and the hello-world image is built for 
spark-r, as expected.

2. Built container images with '-n' to reproduce the issue
The [result](https://gist.github.com/sel/e5cabb9f3bdad5d087349e7fbed75141) is 
that the '-R' option is ignored and the default container image for spark-r is 
built

3. Applied the patch and re-built container images with '-n' and did not 
reproduce the issue
The [result](https://gist.github.com/sel/6af14b95012ba8ff267a4fce6e3bd3bf) is 
that the '-R' option is honoured and the hello-world image is built for 
spark-r, as expected.

Closes #22798 from sel/fix-docker-image-tool-nocache.

Authored-by: Steve 
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b98d916
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b98d916
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b98d916

Branch: refs/heads/master
Commit: 9b98d9166ee2c130ba38a09e8c0aa12e29676b76
Parents: ccd07b7
Author: Steve 
Authored: Thu Oct 25 13:00:59 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu Oct 25 13:00:59 2018 -0700

--
 bin/docker-image-tool.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b98d916/bin/docker-image-tool.sh
--
diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh
index 7256355..61959ca 100755
--- a/bin/docker-image-tool.sh
+++ b/bin/docker-image-tool.sh
@@ -182,7 +182,7 @@ PYDOCKERFILE=
 RDOCKERFILE=
 NOCACHEARG=
 BUILD_PARAMS=
-while getopts f:p:R:mr:t:n:b: option
+while getopts f:p:R:mr:t:nb: option
 do
  case "${option}"
  in


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to…

2018-10-25 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/master 6540c2f8f -> ccd07b736


[SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark toâ¦

## What changes were proposed in this pull request?

Refactor ObjectHashAggregateExecBenchmark to use main method

## How was this patch tested?

Manually tested:
```
bin/spark-submit --class 
org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark 
--jars 
sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar,core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT.jar
 --packages org.spark-project.hive:hive-exec:1.2.1.spark2 
sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT-tests.jar
```
Generated results with:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "hive/test:runMain 
org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark"
```

Closes #22804 from peter-toth/SPARK-25665.

Lead-authored-by: Peter Toth 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ccd07b73
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ccd07b73
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ccd07b73

Branch: refs/heads/master
Commit: ccd07b736640c87ac6980a1c7c2d706ef3bab1bf
Parents: 6540c2f
Author: Peter Toth 
Authored: Thu Oct 25 12:42:31 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 25 12:42:31 2018 -0700

--
 ...ObjectHashAggregateExecBenchmark-results.txt |  45 
 .../ObjectHashAggregateExecBenchmark.scala  | 218 +--
 2 files changed, 152 insertions(+), 111 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ccd07b73/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt
--
diff --git a/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt 
b/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt
new file mode 100644
index 000..f3044da
--- /dev/null
+++ b/sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt
@@ -0,0 +1,45 @@
+
+Hive UDAF vs Spark AF
+
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+hive udaf vs spark af:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+hive udaf w/o group by6370 / 6400  0.0   
97193.6   1.0X
+spark af w/o group by   54 /   63  1.2 
820.8 118.4X
+hive udaf w/ group by 4492 / 4507  0.0   
68539.5   1.4X
+spark af w/ group by w/o fallback   58 /   64  1.1 
881.7 110.2X
+spark af w/ group by w/ fallback   136 /  142  0.5
2075.0  46.8X
+
+
+
+ObjectHashAggregateExec vs SortAggregateExec - typed_count
+
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+object agg v.s. sort agg:Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+sort agg w/ group by41500 / 41630  2.5 
395.8   1.0X
+object agg w/ group by w/o fallback 10075 / 10122 10.4 
 96.1   4.1X
+object agg w/ group by w/ fallback  28131 / 28205  3.7 
268.3   1.5X
+sort agg w/o group by 6182 / 6221 17.0 
 59.0   6.7X
+object agg w/o group by w/o fallback  5435 / 5468 19.3 
 51.8   7.6X
+
+
+
+ObjectHashAggregateExec vs SortAggregateExec - percentile_approx
+
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+object agg v.s. sort agg:Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

svn commit: r30410 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_12_02-6540c2f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Thu Oct 25 19:17:13 2018
New Revision: 30410

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_25_12_02-6540c2f docs


[This commit notification would consist of 1474 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Use Heilmeier Catechism for SPIP template.

2018-10-25 Thread rxin

Repository: spark-website
Updated Branches:
  refs/heads/asf-site e4b87718d -> 005a2a0d1


Use Heilmeier Catechism for SPIP template.


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/005a2a0d
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/005a2a0d
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/005a2a0d

Branch: refs/heads/asf-site
Commit: 005a2a0d1d88c893518d98cddcb7d373a562b339
Parents: e4b8771
Author: Reynold Xin 
Authored: Wed Oct 24 11:51:43 2018 -0700
Committer: Reynold Xin 
Committed: Thu Oct 25 11:25:30 2018 -0700

--
 improvement-proposals.md| 34 ++
 site/improvement-proposals.html | 32 
 2 files changed, 42 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/005a2a0d/improvement-proposals.md
--
diff --git a/improvement-proposals.md b/improvement-proposals.md
index 8fab696..55d57d9 100644
--- a/improvement-proposals.md
+++ b/improvement-proposals.md
@@ -11,7 +11,7 @@ navigation:
 
 The purpose of an SPIP is to inform and involve the user community in major 
improvements to the Spark codebase throughout the development process, to 
increase the likelihood that user needs are met.
 
-SPIPs should be used for significant user-facing or cross-cutting changes, not 
small incremental  improvements. When in doubt, if a committer thinks a change 
needs an SPIP, it does.
+SPIPs should be used for significant user-facing or cross-cutting changes, not 
small incremental improvements. When in doubt, if a committer thinks a change 
needs an SPIP, it does.
 
 What is a SPIP?
 
@@ -48,30 +48,40 @@ Any community member can help by 
discussing whether an SPIP is
 SPIP Process
 Proposing an SPIP
 
-Anyone may propose an SPIP, using the template below. Please only submit an 
SPIP if you are willing to help, at least with discussion.
+Anyone may propose an SPIP, using the document template below. Please only 
submit an SPIP if you are willing to help, at least with discussion.
 
 After a SPIP is created, the author should email mailto:d...@spark.apache.org;>d...@spark.apache.org to notify the 
community of the SPIP, and discussions should ensue on the JIRA ticket.
 
 If an SPIP is too small or incremental and should have been done through the 
normal JIRA process, a committer should remove the SPIP label.
 
 
-Template for an SPIP
+SPIP Document Template
 
-
-Background and Motivation: What problem is this solving?
+A SPIP document is a short document with a few questions, inspired by the 
Heilmeier Catechism:
 
-Target Personas: Examples include data scientists, data 
engineers, library developers, devops. A single SPIP can have multiple target 
personas.
+Q1. What are you trying to do? Articulate your objectives using 
absolutely no jargon.
 
-Goals: What must this allow users to do, that they can't 
currently?
+Q2. What problem is this proposal NOT designed to solve?
 
-Non-Goals: What problem is this proposal not designed to 
solve?
+Q3. How is it done today, and what are the limits of current practice?
 
-Proposed API Changes: Optional section defining APIs changes, if 
any. Backward and forward compatibility must be taken into account.
+Q4. What is new in your approach and why do you think it will be 
successful?
+
+Q5. Who cares? If you are successful, what difference will it make?
+
+Q6. What are the risks?
+
+Q7. How long will it take?
+
+Q8. What are the mid-term and final âexamsâ to check for success?
+
+Appendix A. Proposed API Changes. Optional section defining APIs 
changes, if any. Backward and forward compatibility must be taken into account.
+
+Appendix B. Optional Design Sketch: How are the goals going to be 
accomplished? Give sufficient technical detail to allow a contributor to judge 
whether it's likely to be feasible. Note that this is not a full design 
document.
+
+Appendix C. Optional Rejected Designs: What alternatives were 
considered? Why were they rejected? If no alternatives have been considered, 
the problem needs more thought.
 
-Optional Design Sketch: How are the goals going to be 
accomplished? Give sufficient technical detail to allow a contributor to judge 
whether it's likely to be feasible. This is not a full design document.
 
-Optional Rejected Designs: What alternatives were considered? Why 
were they rejected? If no alternatives have been considered, the problem needs 
more thought.
-
 
 Discussing an SPIP
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/005a2a0d/site/improvement-proposals.html
--
diff --git a/site/improvement-proposals.html

svn commit: r30409 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_25_10_02-a20660b-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Thu Oct 25 17:17:26 2018
New Revision: 30409

Log:
Apache Spark 2.4.1-SNAPSHOT-2018_10_25_10_02-a20660b docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30407 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_08_03-002f9c1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Thu Oct 25 15:17:40 2018
New Revision: 30407

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_25_08_03-002f9c1 docs


[This commit notification would consist of 1473 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 d5e694805 -> a20660b6f


[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png;>

Java:
https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png;>

Python:
https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png;>

R:
https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png;>

Closes #22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 6540c2f8f31bbde4df57e48698f46bb1815740ff)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a20660b6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a20660b6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a20660b6

Branch: refs/heads/branch-2.4
Commit: a20660b6fcee2d436ef182337255cea5e2eb7216
Parents: d5e6948
Author: WeichenXu 
Authored: Thu Oct 25 23:03:16 2018 +0800
Committer: Wenchen Fan 
Committed: Thu Oct 25 23:04:06 2018 +0800

--
 docs/_data/menu-ml.yaml |   2 +
 docs/ml-datasource.md   | 108 +++
 .../spark/ml/source/image/ImageDataSource.scala |  17 +--
 3 files changed, 120 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/docs/_data/menu-ml.yaml
--
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index b5a6641..8e366f7 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,5 +1,7 @@
 - text: Basic statistics
   url: ml-statistics.html
+- text: Data sources
+  url: ml-datasource
 - text: Pipelines
   url: ml-pipeline.html
 - text: Extracting, transforming and selecting features

http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/docs/ml-datasource.md
--
diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md
new file mode 100644
index 000..1508332
--- /dev/null
+++ b/docs/ml-datasource.md
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also 
provide some specific data sources for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it can 
load compressed image (jpeg, png, etc.) into raw image representation via 
`ImageIO` in Java library.
+The loaded DataFrame has one `StructType` column: "image", containing image 
data stored as image schema.
+The schema of the `image` column is:
+ - origin: `StringType` (represents the file path of the image)
+ - height: `IntegerType` (height of the image)
+ - width: `IntegerType` (width of the image)
+ - nChannels: `IntegerType` (number of image channels)
+ - mode: `IntegerType` (OpenCV-compatible type)
+ - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in 
most cases)
+
+
+
+
+[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
+implements a Spark SQL data source API for loading image data as a DataFrame.
+
+{% highlight scala %}
+scala> val df = spark.read.format("image").option("dropInvalid", 
true).load("data/mllib/images/origin/kittens")
+df: org.apache.spark.sql.DataFrame = [image: struct]
+
+scala> df.select("image.origin", "image.width", 
"image.height").show(truncate=false)
++---+-+--+
+|origin 
|width|height|
++---+-+--+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg   |300  
|311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg|199  
|313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  
|200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg|300  
|296   |
++---+-+--+
+{% endhighlight %}
+
+
+

spark git commit: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 002f9c169 -> 6540c2f8f


[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png;>

Java:
https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png;>

Python:
https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png;>

R:
https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png;>

Closes #22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6540c2f8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6540c2f8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6540c2f8

Branch: refs/heads/master
Commit: 6540c2f8f31bbde4df57e48698f46bb1815740ff
Parents: 002f9c1
Author: WeichenXu 
Authored: Thu Oct 25 23:03:16 2018 +0800
Committer: Wenchen Fan 
Committed: Thu Oct 25 23:03:16 2018 +0800

--
 docs/_data/menu-ml.yaml |   2 +
 docs/ml-datasource.md   | 108 +++
 .../spark/ml/source/image/ImageDataSource.scala |  17 +--
 3 files changed, 120 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/docs/_data/menu-ml.yaml
--
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index b5a6641..8e366f7 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,5 +1,7 @@
 - text: Basic statistics
   url: ml-statistics.html
+- text: Data sources
+  url: ml-datasource
 - text: Pipelines
   url: ml-pipeline.html
 - text: Extracting, transforming and selecting features

http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/docs/ml-datasource.md
--
diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md
new file mode 100644
index 000..1508332
--- /dev/null
+++ b/docs/ml-datasource.md
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also 
provide some specific data sources for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it can 
load compressed image (jpeg, png, etc.) into raw image representation via 
`ImageIO` in Java library.
+The loaded DataFrame has one `StructType` column: "image", containing image 
data stored as image schema.
+The schema of the `image` column is:
+ - origin: `StringType` (represents the file path of the image)
+ - height: `IntegerType` (height of the image)
+ - width: `IntegerType` (width of the image)
+ - nChannels: `IntegerType` (number of image channels)
+ - mode: `IntegerType` (OpenCV-compatible type)
+ - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in 
most cases)
+
+
+
+
+[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
+implements a Spark SQL data source API for loading image data as a DataFrame.
+
+{% highlight scala %}
+scala> val df = spark.read.format("image").option("dropInvalid", 
true).load("data/mllib/images/origin/kittens")
+df: org.apache.spark.sql.DataFrame = [image: struct]
+
+scala> df.select("image.origin", "image.width", 
"image.height").show(truncate=false)
++---+-+--+
+|origin 
|width|height|
++---+-+--+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg   |300  
|311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg|199  
|313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  
|200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg|300  
|296   |
++---+-+--+
+{% endhighlight %}
+
+
+
+[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
+implements Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight java %}

spark git commit: [SPARK-24794][CORE] Driver launched through rest should use all masters

2018-10-25 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 65c653fb4 -> 002f9c169


[SPARK-24794][CORE] Driver launched through rest should use all masters

## What changes were proposed in this pull request?

In standalone cluster mode, one could launch driver with supervise mode
enabled. StandaloneRestServer class uses the host and port of current
master as the spark.master property while launching the driver
(even if you are running in HA mode). This class also ignores the
spark.master property passed as part of the request.

Due to the above problem, if the Spark masters switch due to some reason
and your driver is killed unexpectedly and relaunched, it will try to
connect to the master which is in the driver command specified as
-Dspark.master. But this master will be in STANDBY mode and after trying
multiple times, the SparkContext will kill itself (even though secondary
master was alive and healthy).

This change picks the spark.master property from request and uses it to
launch the driver process. Due to this, the driver process has both
masters in -Dspark.master property. Even if the masters switch, SparkContext
can still connect to the ALIVE master and work correctly.

## How was this patch tested?
This patch was manually tested on a standalone cluster running 2.2.1. It was 
rebased on current master and all tests were executed. I have added a unit test 
for this change (but since I am new I hope I have covered all).

Closes #21816 from bsikander/rest_driver_fix.

Authored-by: Behroz Sikander 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/002f9c16
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/002f9c16
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/002f9c16

Branch: refs/heads/master
Commit: 002f9c169eb20b0d71b6d0296595f343c7f5bab2
Parents: 65c653f
Author: Behroz Sikander 
Authored: Thu Oct 25 08:36:44 2018 -0500
Committer: Sean Owen 
Committed: Thu Oct 25 08:36:44 2018 -0500

--
 .../deploy/rest/StandaloneRestServer.scala  | 12 +++-
 .../deploy/rest/StandaloneRestSubmitSuite.scala | 20 
 2 files changed, 31 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/002f9c16/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala
index 22b65ab..afa1a5f 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala
@@ -138,6 +138,16 @@ private[rest] class StandaloneSubmitRequestServlet(
 val driverExtraClassPath = 
sparkProperties.get("spark.driver.extraClassPath")
 val driverExtraLibraryPath = 
sparkProperties.get("spark.driver.extraLibraryPath")
 val superviseDriver = sparkProperties.get("spark.driver.supervise")
+// The semantics of "spark.master" and the masterUrl are different. While 
the
+// property "spark.master" could contain all registered masters, masterUrl
+// contains only the active master. To make sure a Spark driver can recover
+// in a multi-master setup, we use the "spark.master" property while 
submitting
+// the driver.
+val masters = sparkProperties.get("spark.master")
+val (_, masterPort) = Utils.extractHostPortFromSparkUrl(masterUrl)
+val masterRestPort = this.conf.getInt("spark.master.rest.port", 6066)
+val updatedMasters = masters.map(
+  _.replace(s":$masterRestPort", s":$masterPort")).getOrElse(masterUrl)
 val appArgs = request.appArgs
 // Filter SPARK_LOCAL_(IP|HOSTNAME) environment variables from being set 
on the remote system.
 val environmentVariables =
@@ -146,7 +156,7 @@ private[rest] class StandaloneSubmitRequestServlet(
 // Construct driver description
 val conf = new SparkConf(false)
   .setAll(sparkProperties)
-  .set("spark.master", masterUrl)
+  .set("spark.master", updatedMasters)
 val extraClassPath = 
driverExtraClassPath.toSeq.flatMap(_.split(File.pathSeparator))
 val extraLibraryPath = 
driverExtraLibraryPath.toSeq.flatMap(_.split(File.pathSeparator))
 val extraJavaOpts = 
driverExtraJavaOptions.map(Utils.splitCommandString).getOrElse(Seq.empty)

http://git-wip-us.apache.org/repos/asf/spark/blob/002f9c16/core/src/test/scala/org/apache/spark/deploy/rest/StandaloneRestSubmitSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/rest/StandaloneRestSubmitSuite.scala

spark git commit: [BUILD] Close stale PRs

2018-10-25 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 3123c7f48 -> 65c653fb4


[BUILD] Close stale PRs

Closes #22567
Closes #18457
Closes #21517
Closes #21858
Closes #22383
Closes #19219
Closes #22401
Closes #22811
Closes #20405
Closes #21933

Closes #22819 from srowen/ClosePRs.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65c653fb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65c653fb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65c653fb

Branch: refs/heads/master
Commit: 65c653fb455336948c5af2e0f381d1d8f5640874
Parents: 3123c7f
Author: Sean Owen 
Authored: Thu Oct 25 08:35:27 2018 -0500
Committer: Sean Owen 
Committed: Thu Oct 25 08:35:27 2018 -0500

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25808][BUILD] Upgrade jsr305 version from 1.3.9 to 3.0.0

2018-10-25 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master cb5ea201d -> 3123c7f48


[SPARK-25808][BUILD] Upgrade jsr305 version from 1.3.9 to 3.0.0

## What changes were proposed in this pull request?

We find below warnings when build spark project:

```
[warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9
[warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0)
[warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
[warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends on 
1.3.9)
[warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
```
So ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning

Upgrade one of the dependencies  jsr305 version from 1.3.9 to 3.0.0

## How was this patch tested?

sbt "core/testOnly"
sbt "sql/testOnly"

Closes #22803 from daviddingly/master.

Authored-by: xiaoding 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3123c7f4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3123c7f4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3123c7f4

Branch: refs/heads/master
Commit: 3123c7f4881225e912842e9897b0e0e6bd0f4b20
Parents: cb5ea20
Author: xiaoding 
Authored: Thu Oct 25 07:06:17 2018 -0500
Committer: Sean Owen 
Committed: Thu Oct 25 07:06:17 2018 -0500

--
 dev/deps/spark-deps-hadoop-2.7 | 2 +-
 dev/deps/spark-deps-hadoop-3.1 | 2 +-
 pom.xml| 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3123c7f4/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 06173f7..537831e 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -127,7 +127,7 @@ json4s-core_2.11-3.5.3.jar
 json4s-jackson_2.11-3.5.3.jar
 json4s-scalap_2.11-3.5.3.jar
 jsp-api-2.1.jar
-jsr305-1.3.9.jar
+jsr305-3.0.0.jar
 jta-1.1.jar
 jtransforms-2.4.0.jar
 jul-to-slf4j-1.7.16.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/3123c7f4/dev/deps/spark-deps-hadoop-3.1
--
diff --git a/dev/deps/spark-deps-hadoop-3.1 b/dev/deps/spark-deps-hadoop-3.1
index 62fddf0..bc4ef31 100644
--- a/dev/deps/spark-deps-hadoop-3.1
+++ b/dev/deps/spark-deps-hadoop-3.1
@@ -128,7 +128,7 @@ json4s-core_2.11-3.5.3.jar
 json4s-jackson_2.11-3.5.3.jar
 json4s-scalap_2.11-3.5.3.jar
 jsp-api-2.1.jar
-jsr305-1.3.9.jar
+jsr305-3.0.0.jar
 jta-1.1.jar
 jtransforms-2.4.0.jar
 jul-to-slf4j-1.7.16.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/3123c7f4/pom.xml
--
diff --git a/pom.xml b/pom.xml
index b1f0a53..92934c1 100644
--- a/pom.xml
+++ b/pom.xml
@@ -172,7 +172,7 @@
 2.22.2
 2.9.3
 3.5.2
-1.3.9
+3.0.0
 0.9.3
 4.7
 1.1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25746][SQL] Refactoring ExpressionEncoder to get rid of flat flag

2018-10-25 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master ddd1b1e8a -> cb5ea201d


[SPARK-25746][SQL] Refactoring ExpressionEncoder to get rid of flat flag

## What changes were proposed in this pull request?

This is inspired during implementing #21732. For now `ScalaReflection` needs to 
consider how `ExpressionEncoder` uses generated serializers and deserializers. 
And `ExpressionEncoder` has a weird `flat` flag. After discussion with 
cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should 
make SPARK-24762 easier to do.

To summarize the proposed changes:

1. `serializerFor` and `deserializerFor` return expressions for 
serializing/deserializing an input expression for a given type. They are 
private and should not be called directly.
2. `serializerForType` and `deserializerForType` returns an expression for 
serializing/deserializing for an object of type T to/from Spark SQL 
representation. It assumes the input object/Spark SQL representation is located 
at ordinal 0 of a row.

So in other words, `serializerForType` and `deserializerForType` return 
expressions for atomically serializing/deserializing JVM object to/from Spark 
SQL value.

A serializer returned by `serializerForType` will serialize an object at 
`row(0)` to a corresponding Spark SQL representation, e.g. primitive type, 
array, map, struct.

A deserializer returned by `deserializerForType` will deserialize an input 
field at `row(0)` to an object with given type.

3. The construction of `ExpressionEncoder` takes a pair of serializer and 
deserializer for type `T`. It uses them to create serializer and deserializer 
for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember 
if serializer is flat or not. When we need to construct new `ExpressionEncoder` 
based on existing ones, we only need to change input location in the atomic 
serializer and deserializer.

## How was this patch tested?

Existing tests.

Closes #22749 from viirya/SPARK-24762-refactor.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb5ea201
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb5ea201
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb5ea201

Branch: refs/heads/master
Commit: cb5ea201df5fae8aacb653ffb4147b9288bca1e9
Parents: ddd1b1e
Author: Liang-Chi Hsieh 
Authored: Thu Oct 25 19:27:45 2018 +0800
Committer: Wenchen Fan 
Committed: Thu Oct 25 19:27:45 2018 +0800

--
 .../scala/org/apache/spark/sql/Encoders.scala   |   8 +-
 .../spark/sql/catalyst/JavaTypeInference.scala  |  78 ---
 .../spark/sql/catalyst/ScalaReflection.scala| 182 -
 .../catalyst/encoders/ExpressionEncoder.scala   | 201 +++
 .../sql/catalyst/encoders/RowEncoder.scala  |  16 +-
 .../sql/catalyst/ScalaReflectionSuite.scala |  70 ---
 .../encoders/ExpressionEncoderSuite.scala   |   6 +-
 .../sql/catalyst/encoders/RowEncoderSuite.scala |   2 +-
 .../scala/org/apache/spark/sql/Dataset.scala|  10 +-
 .../spark/sql/KeyValueGroupedDataset.scala  |   2 +-
 .../aggregate/TypedAggregateExpression.scala|  12 +-
 .../org/apache/spark/sql/DatasetSuite.scala |   2 +-
 12 files changed, 304 insertions(+), 285 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cb5ea201/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala
--
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala
index b47ec0b..8a30c81 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala
@@ -203,12 +203,10 @@ object Encoders {
 validatePublicClass[T]()
 
 ExpressionEncoder[T](
-  schema = new StructType().add("value", BinaryType),
-  flat = true,
-  serializer = Seq(
+  objSerializer =
 EncodeUsingSerializer(
-  BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), 
kryo = useKryo)),
-  deserializer =
+  BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), 
kryo = useKryo),
+  objDeserializer =
 DecodeUsingSerializer[T](
   Cast(GetColumnByOrdinal(0, BinaryType), BinaryType),
   classTag[T],

http://git-wip-us.apache.org/repos/asf/spark/blob/cb5ea201/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

svn commit: r30404 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_25_00_03-ddd1b1e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-25 Thread pwendell

Author: pwendell
Date: Thu Oct 25 07:17:50 2018
New Revision: 30404

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_25_00_03-ddd1b1e docs


[This commit notification would consist of 1473 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24572][SPARKR] "eager execution" for R shell, IDE

2018-10-25 Thread felixcheung

Repository: spark
Updated Branches:
  refs/heads/master 19ada15d1 -> ddd1b1e8a


[SPARK-24572][SPARKR] "eager execution" for R shell, IDE

## What changes were proposed in this pull request?

Check the `spark.sql.repl.eagerEval.enabled` configuration property in 
SparkDataFrame `show()` method. If the `SparkSession` has eager execution 
enabled, the data will be returned to the R client when the data frame is 
created. So instead of seeing this
```
> df <- createDataFrame(faithful)
> df
SparkDataFrame[eruptions:double, waiting:double]
```
you will see
```
> df <- createDataFrame(faithful)
> df
+-+---+
|eruptions|waiting|
+-+---+
|  3.6|   79.0|
|  1.8|   54.0|
|3.333|   74.0|
|2.283|   62.0|
|4.533|   85.0|
|2.883|   55.0|
|  4.7|   88.0|
|  3.6|   85.0|
| 1.95|   51.0|
| 4.35|   85.0|
|1.833|   54.0|
|3.917|   84.0|
|  4.2|   78.0|
| 1.75|   47.0|
|  4.7|   83.0|
|2.167|   52.0|
| 1.75|   62.0|
|  4.8|   84.0|
|  1.6|   52.0|
| 4.25|   79.0|
+-+---+
only showing top 20 rows
```

## How was this patch tested?
Manual tests as well as unit tests (one new test case is added).

Author: adrian555 

Closes #22455 from adrian555/eager_execution.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ddd1b1e8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ddd1b1e8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ddd1b1e8

Branch: refs/heads/master
Commit: ddd1b1e8aec023e61b186c494ccbc182db2eb3ca
Parents: 19ada15
Author: adrian555 
Authored: Wed Oct 24 23:42:06 2018 -0700
Committer: Felix Cheung 
Committed: Wed Oct 24 23:42:06 2018 -0700

--
 R/pkg/R/DataFrame.R | 36 --
 R/pkg/tests/fulltests/test_sparkSQL_eager.R | 72 
 docs/sparkr.md  | 42 
 .../org/apache/spark/sql/internal/SQLConf.scala |  7 +-
 4 files changed, 148 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ddd1b1e8/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 3469188..bf82d0c 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -226,7 +226,9 @@ setMethod("showDF",
 
 #' show
 #'
-#' Print class and type information of a Spark object.
+#' If eager evaluation is enabled and the Spark object is a SparkDataFrame, 
evaluate the
+#' SparkDataFrame and print top rows of the SparkDataFrame, otherwise, print 
the class 
+#' and type information of the Spark object.
 #'
 #' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, 
WindowSpec.
 #'
@@ -244,11 +246,33 @@ setMethod("showDF",
 #' @note show(SparkDataFrame) since 1.4.0
 setMethod("show", "SparkDataFrame",
   function(object) {
-cols <- lapply(dtypes(object), function(l) {
-  paste(l, collapse = ":")
-})
-s <- paste(cols, collapse = ", ")
-cat(paste(class(object), "[", s, "]\n", sep = ""))
+allConf <- sparkR.conf()
+prop <- allConf[["spark.sql.repl.eagerEval.enabled"]]
+if (!is.null(prop) && identical(prop, "true")) {
+  argsList <- list()
+  argsList$x <- object
+  prop <- allConf[["spark.sql.repl.eagerEval.maxNumRows"]]
+  if (!is.null(prop)) {
+numRows <- as.integer(prop)
+if (numRows > 0) {
+  argsList$numRows <- numRows
+}
+  }
+  prop <- allConf[["spark.sql.repl.eagerEval.truncate"]]
+  if (!is.null(prop)) {
+truncate <- as.integer(prop)
+if (truncate > 0) {
+  argsList$truncate <- truncate
+}
+  }
+  do.call(showDF, argsList)
+} else {
+  cols <- lapply(dtypes(object), function(l) {
+paste(l, collapse = ":")
+  })
+  s <- paste(cols, collapse = ", ")
+  cat(paste(class(object), "[", s, "]\n", sep = ""))
+}
   })
 
 #' DataTypes

http://git-wip-us.apache.org/repos/asf/spark/blob/ddd1b1e8/R/pkg/tests/fulltests/test_sparkSQL_eager.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL_eager.R 
b/R/pkg/tests/fulltests/test_sparkSQL_eager.R
new file mode 100644
index 000..df7354f
--- /dev/null
+++ b/R/pkg/tests/fulltests/test_sparkSQL_eager.R
@@ -0,0 +1,72 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+#

spark git commit: [SPARK-24516][K8S] Change Python default to Python3

2018-10-25 Thread felixcheung

Repository: spark
Updated Branches:
  refs/heads/master b2e325625 -> 19ada15d1


[SPARK-24516][K8S] Change Python default to Python3

## What changes were proposed in this pull request?

As this is targeted for 3.0.0 and Python2 will be deprecated by Jan 1st, 2020, 
I feel it is appropriate to change the default to Python3. Especially as these 
projects [found here](https://python3statement.org/) are deprecating their 
support.

## How was this patch tested?

Unit and Integration tests

Author: Ilan Filonenko 

Closes #22810 from ifilonenko/SPARK-24516.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/19ada15d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/19ada15d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/19ada15d

Branch: refs/heads/master
Commit: 19ada15d1b15256de4e3bf2f4b17d87ea0d65cc3
Parents: b2e3256
Author: Ilan Filonenko 
Authored: Wed Oct 24 23:29:47 2018 -0700
Committer: Felix Cheung 
Committed: Wed Oct 24 23:29:47 2018 -0700

--
 docs/running-on-kubernetes.md  | 2 +-
 .../core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/19ada15d/docs/running-on-kubernetes.md
--
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index d629ed3..60c9279 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -816,7 +816,7 @@ specific to Spark on Kubernetes.
 
 
   spark.kubernetes.pyspark.pythonVersion
-  "2"
+  "3"
   
This sets the major Python version of the docker image used to run the 
driver and executor containers. Can either be 2 or 3. 
   

http://git-wip-us.apache.org/repos/asf/spark/blob/19ada15d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
--
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
index c2ad80c..fff8fa4 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
@@ -223,7 +223,7 @@ private[spark] object Config extends Logging {
   .stringConf
   .checkValue(pv => List("2", "3").contains(pv),
 "Ensure that major Python version is either Python2 or Python3")
-  .createWithDefault("2")
+  .createWithDefault("3")
 
   val KUBERNETES_KERBEROS_KRB5_FILE =
 ConfigBuilder("spark.kubernetes.kerberos.krb5.path")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

38 matches

Mail list logo