spark git commit: [SQL] Fix typo in DataframeWriter doc

2017-07-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6550086bb -> 51f99fb25


[SQL] Fix typo in DataframeWriter doc

## What changes were proposed in this pull request?

The format of none should be consistent with other compression 
codec(\`snappy\`, \`lz4\`) as \`none\`.

## How was this patch tested?

This is a typo.

Author: GuoChenzhao 

Closes #18758 from gczsjdy/typo.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/51f99fb2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/51f99fb2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/51f99fb2

Branch: refs/heads/master
Commit: 51f99fb25b0d524164e7bf15e63d99abb6c22431
Parents: 6550086
Author: GuoChenzhao 
Authored: Sun Jul 30 22:18:38 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Jul 30 22:18:38 2017 +0900

--
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/51f99fb2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 255c406..0fcda46 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -499,7 +499,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) 
{
* 
* `compression` (default is the value specified in 
`spark.sql.parquet.compression.codec`):
* compression codec to use when saving to file. This can be one of the 
known case-insensitive
-   * shorten names(none, `snappy`, `gzip`, and `lzo`). This will override
+   * shorten names(`none`, `snappy`, `gzip`, and `lzo`). This will override
* `spark.sql.parquet.compression.codec`.
* 
*


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle registry

2017-08-02 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 467ee8dff -> 690f491f6


[SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle registry

## What changes were proposed in this pull request?

When using PySpark broadcast variables in a multi-threaded environment,  
`SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race 
condition can occur when broadcast variables that are pickled from one thread 
get added to the shared ` _pickled_broadcast_vars` and become part of the 
python command from another thread.  This PR introduces a thread-safe pickled 
registry using thread local storage so that when python command is pickled 
(causing the broadcast variable to be pickled and added to the registry) each 
thread will have their own view of the pickle registry to retrieve and clear 
the broadcast variables used.

## How was this patch tested?

Added a unit test that causes this race condition using another thread.

Author: Bryan Cutler 

Closes #18823 from BryanCutler/branch-2.2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/690f491f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/690f491f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/690f491f

Branch: refs/heads/branch-2.2
Commit: 690f491f6e979bc960baa05de1a66306b06dc85a
Parents: 467ee8d
Author: Bryan Cutler 
Authored: Thu Aug 3 10:28:19 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 3 10:28:19 2017 +0900

--
 python/pyspark/broadcast.py | 19 +
 python/pyspark/context.py   |  4 ++--
 python/pyspark/tests.py | 44 
 3 files changed, 65 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/690f491f/python/pyspark/broadcast.py
--
diff --git a/python/pyspark/broadcast.py b/python/pyspark/broadcast.py
index b1b59f7..02fc515 100644
--- a/python/pyspark/broadcast.py
+++ b/python/pyspark/broadcast.py
@@ -19,6 +19,7 @@ import os
 import sys
 import gc
 from tempfile import NamedTemporaryFile
+import threading
 
 from pyspark.cloudpickle import print_exec
 from pyspark.util import _exception_message
@@ -139,6 +140,24 @@ class Broadcast(object):
 return _from_id, (self._jbroadcast.id(),)
 
 
+class BroadcastPickleRegistry(threading.local):
+""" Thread-local registry for broadcast variables that have been pickled
+"""
+
+def __init__(self):
+self.__dict__.setdefault("_registry", set())
+
+def __iter__(self):
+for bcast in self._registry:
+yield bcast
+
+def add(self, bcast):
+self._registry.add(bcast)
+
+def clear(self):
+self._registry.clear()
+
+
 if __name__ == "__main__":
 import doctest
 (failure_count, test_count) = doctest.testmod()

http://git-wip-us.apache.org/repos/asf/spark/blob/690f491f/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 3be0732..49be76e 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -30,7 +30,7 @@ from py4j.protocol import Py4JError
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
-from pyspark.broadcast import Broadcast
+from pyspark.broadcast import Broadcast, BroadcastPickleRegistry
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
@@ -198,7 +198,7 @@ class SparkContext(object):
 # This allows other code to determine which Broadcast instances have
 # been pickled, so it can determine which Java broadcast objects to
 # send.
-self._pickled_broadcast_vars = set()
+self._pickled_broadcast_vars = BroadcastPickleRegistry()
 
 SparkFiles._sc = self
 root_dir = SparkFiles.getRootDirectory()

http://git-wip-us.apache.org/repos/asf/spark/blob/690f491f/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index bb13de5..20a933e 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -858,6 +858,50 @@ class RDDTests(ReusedPySparkTestCase):
 self.assertEqual(N, size)
 self.assertEqual(checksum, csum)
 
+def test_multithread_broadcast_pickle(self):
+import threading
+
+b1 = self.sc.broadcast(list(range(3)))
+b2 = self.sc.broadcast(list(range(3)))
+
+def f1():
+return b1.value
+
+def f2():
+return b2.value
+
+funcs_num_pickled = {f1: None, f2: None}
+
+def 

spark git commit: [SPARK-12717][PYTHON][BRANCH-2.1] Adding thread-safe broadcast pickle registry

2017-08-02 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 b31b30209 -> d93e45b8b


[SPARK-12717][PYTHON][BRANCH-2.1] Adding thread-safe broadcast pickle registry

## What changes were proposed in this pull request?

When using PySpark broadcast variables in a multi-threaded environment,  
`SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race 
condition can occur when broadcast variables that are pickled from one thread 
get added to the shared ` _pickled_broadcast_vars` and become part of the 
python command from another thread.  This PR introduces a thread-safe pickled 
registry using thread local storage so that when python command is pickled 
(causing the broadcast variable to be pickled and added to the registry) each 
thread will have their own view of the pickle registry to retrieve and clear 
the broadcast variables used.

## How was this patch tested?

Added a unit test that causes this race condition using another thread.

Author: Bryan Cutler 

Closes #18825 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717-2_1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d93e45b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d93e45b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d93e45b8

Branch: refs/heads/branch-2.1
Commit: d93e45b8bad6efd34ed7c03b2602df35788961a4
Parents: b31b302
Author: Bryan Cutler 
Authored: Thu Aug 3 10:35:56 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 3 10:35:56 2017 +0900

--
 python/pyspark/broadcast.py | 19 +
 python/pyspark/context.py   |  4 ++--
 python/pyspark/tests.py | 44 
 3 files changed, 65 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d93e45b8/python/pyspark/broadcast.py
--
diff --git a/python/pyspark/broadcast.py b/python/pyspark/broadcast.py
index 74dee14..8f9b42e 100644
--- a/python/pyspark/broadcast.py
+++ b/python/pyspark/broadcast.py
@@ -19,6 +19,7 @@ import os
 import sys
 import gc
 from tempfile import NamedTemporaryFile
+import threading
 
 from pyspark.cloudpickle import print_exec
 
@@ -137,6 +138,24 @@ class Broadcast(object):
 return _from_id, (self._jbroadcast.id(),)
 
 
+class BroadcastPickleRegistry(threading.local):
+""" Thread-local registry for broadcast variables that have been pickled
+"""
+
+def __init__(self):
+self.__dict__.setdefault("_registry", set())
+
+def __iter__(self):
+for bcast in self._registry:
+yield bcast
+
+def add(self, bcast):
+self._registry.add(bcast)
+
+def clear(self):
+self._registry.clear()
+
+
 if __name__ == "__main__":
 import doctest
 (failure_count, test_count) = doctest.testmod()

http://git-wip-us.apache.org/repos/asf/spark/blob/d93e45b8/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index ac4b2b0..5a4c2fa 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -30,7 +30,7 @@ from py4j.protocol import Py4JError
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
-from pyspark.broadcast import Broadcast
+from pyspark.broadcast import Broadcast, BroadcastPickleRegistry
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
@@ -200,7 +200,7 @@ class SparkContext(object):
 # This allows other code to determine which Broadcast instances have
 # been pickled, so it can determine which Java broadcast objects to
 # send.
-self._pickled_broadcast_vars = set()
+self._pickled_broadcast_vars = BroadcastPickleRegistry()
 
 SparkFiles._sc = self
 root_dir = SparkFiles.getRootDirectory()

http://git-wip-us.apache.org/repos/asf/spark/blob/d93e45b8/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 8d227ea..25ed127 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -793,6 +793,50 @@ class RDDTests(ReusedPySparkTestCase):
 self.assertEqual(N, size)
 self.assertEqual(checksum, csum)
 
+def test_multithread_broadcast_pickle(self):
+import threading
+
+b1 = self.sc.broadcast(list(range(3)))
+b2 = self.sc.broadcast(list(range(3)))
+
+def f1():
+return b1.value
+
+def f2():
+return b2.value
+
+funcs_num_pickled = {f1: None, f2: None}
+
+def do_pickle(f, sc):
+  

spark git commit: [SPARK-21602][R] Add map_keys and map_values functions to R

2017-08-03 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master e7c59b417 -> 97ba49183


[SPARK-21602][R] Add map_keys and map_values functions to R

## What changes were proposed in this pull request?

This PR adds `map_values` and `map_keys` to R API.

```r
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v = create_map(df$model, df$cyl))
> head(select(tmp, map_keys(tmp$v)))
```
```
map_keys(v)
1 Mazda RX4
2 Mazda RX4 Wag
3Datsun 710
4Hornet 4 Drive
5 Hornet Sportabout
6   Valiant
```
```r
> head(select(tmp, map_values(tmp$v)))
```
```
  map_values(v)
1 6
2 6
3 4
4 6
5 8
6 6
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon 

Closes #18809 from HyukjinKwon/map-keys-values-r.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/97ba4918
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/97ba4918
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/97ba4918

Branch: refs/heads/master
Commit: 97ba4918368ba15334427bdd91230829ece606f6
Parents: e7c59b4
Author: hyukjinkwon 
Authored: Thu Aug 3 23:00:00 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 3 23:00:00 2017 +0900

--
 R/pkg/NAMESPACE   |  2 ++
 R/pkg/R/functions.R   | 33 +-
 R/pkg/R/generics.R| 10 +
 R/pkg/tests/fulltests/test_sparkSQL.R |  8 
 4 files changed, 52 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/97ba4918/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 232f5cf..a1dd1af 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -286,6 +286,8 @@ exportMethods("%<=>%",
   "lower",
   "lpad",
   "ltrim",
+  "map_keys",
+  "map_values",
   "max",
   "md5",
   "mean",

http://git-wip-us.apache.org/repos/asf/spark/blob/97ba4918/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 86507f1..5a46d73 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -195,7 +195,10 @@ NULL
 #' head(tmp2)
 #' head(select(tmp, posexplode(tmp$v1)))
 #' head(select(tmp, sort_array(tmp$v1)))
-#' head(select(tmp, sort_array(tmp$v1, asc = FALSE)))}
+#' head(select(tmp, sort_array(tmp$v1, asc = FALSE)))
+#' tmp3 <- mutate(df, v3 = create_map(df$model, df$cyl))
+#' head(select(tmp3, map_keys(tmp3$v3)))
+#' head(select(tmp3, map_values(tmp3$v3)))}
 NULL
 
 #' Window functions for Column operations
@@ -3056,6 +3059,34 @@ setMethod("array_contains",
   })
 
 #' @details
+#' \code{map_keys}: Returns an unordered array containing the keys of the map.
+#'
+#' @rdname column_collection_functions
+#' @aliases map_keys map_keys,Column-method
+#' @export
+#' @note map_keys since 2.3.0
+setMethod("map_keys",
+  signature(x = "Column"),
+  function(x) {
+jc <- callJStatic("org.apache.spark.sql.functions", "map_keys", 
x@jc)
+column(jc)
+ })
+
+#' @details
+#' \code{map_values}: Returns an unordered array containing the values of the 
map.
+#'
+#' @rdname column_collection_functions
+#' @aliases map_values map_values,Column-method
+#' @export
+#' @note map_values since 2.3.0
+setMethod("map_values",
+  signature(x = "Column"),
+  function(x) {
+jc <- callJStatic("org.apache.spark.sql.functions", "map_values", 
x@jc)
+column(jc)
+  })
+
+#' @details
 #' \code{explode}: Creates a new row for each element in the given array or 
map column.
 #'
 #' @rdname column_collection_functions

http://git-wip-us.apache.org/repos/asf/spark/blob/97ba4918/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 9209874..df91c35 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1213,6 +1213,16 @@ setGeneric("lpad", function(x, len, pad) { 
standardGeneric("lpad") })
 #' @name NULL
 setGeneric("ltrim", function(x) { standardGeneric("ltrim") })
 
+#' @rdname column_collection_functions
+#' @export
+#' @name NULL
+setGeneric("map_keys", function(x) { standardGeneric("map_keys") })
+
+#' @rdname column_collection_functions
+#' @export
+#' @name NULL
+setGeneric("map_values", function(x) { standardGeneric("map_values") })
+
 #' @rdname column_misc_functions
 #' @export
 #' @name NULL


spark git commit: [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry

2017-08-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 58da1a245 -> 77cc0d67d


[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry

## What changes were proposed in this pull request?

When using PySpark broadcast variables in a multi-threaded environment,  
`SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race 
condition can occur when broadcast variables that are pickled from one thread 
get added to the shared ` _pickled_broadcast_vars` and become part of the 
python command from another thread.  This PR introduces a thread-safe pickled 
registry using thread local storage so that when python command is pickled 
(causing the broadcast variable to be pickled and added to the registry) each 
thread will have their own view of the pickle registry to retrieve and clear 
the broadcast variables used.

## How was this patch tested?

Added a unit test that causes this race condition using another thread.

Author: Bryan Cutler 

Closes #18695 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/77cc0d67
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/77cc0d67
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/77cc0d67

Branch: refs/heads/master
Commit: 77cc0d67d5a7ea526f8efd37b2590923953cb8e0
Parents: 58da1a2
Author: Bryan Cutler 
Authored: Wed Aug 2 07:12:23 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Aug 2 07:12:23 2017 +0900

--
 python/pyspark/broadcast.py | 19 +
 python/pyspark/context.py   |  4 ++--
 python/pyspark/tests.py | 44 
 3 files changed, 65 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/77cc0d67/python/pyspark/broadcast.py
--
diff --git a/python/pyspark/broadcast.py b/python/pyspark/broadcast.py
index b1b59f7..02fc515 100644
--- a/python/pyspark/broadcast.py
+++ b/python/pyspark/broadcast.py
@@ -19,6 +19,7 @@ import os
 import sys
 import gc
 from tempfile import NamedTemporaryFile
+import threading
 
 from pyspark.cloudpickle import print_exec
 from pyspark.util import _exception_message
@@ -139,6 +140,24 @@ class Broadcast(object):
 return _from_id, (self._jbroadcast.id(),)
 
 
+class BroadcastPickleRegistry(threading.local):
+""" Thread-local registry for broadcast variables that have been pickled
+"""
+
+def __init__(self):
+self.__dict__.setdefault("_registry", set())
+
+def __iter__(self):
+for bcast in self._registry:
+yield bcast
+
+def add(self, bcast):
+self._registry.add(bcast)
+
+def clear(self):
+self._registry.clear()
+
+
 if __name__ == "__main__":
 import doctest
 (failure_count, test_count) = doctest.testmod()

http://git-wip-us.apache.org/repos/asf/spark/blob/77cc0d67/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 80cb48f..a704604 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -30,7 +30,7 @@ from py4j.protocol import Py4JError
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
-from pyspark.broadcast import Broadcast
+from pyspark.broadcast import Broadcast, BroadcastPickleRegistry
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
@@ -195,7 +195,7 @@ class SparkContext(object):
 # This allows other code to determine which Broadcast instances have
 # been pickled, so it can determine which Java broadcast objects to
 # send.
-self._pickled_broadcast_vars = set()
+self._pickled_broadcast_vars = BroadcastPickleRegistry()
 
 SparkFiles._sc = self
 root_dir = SparkFiles.getRootDirectory()

http://git-wip-us.apache.org/repos/asf/spark/blob/77cc0d67/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 73ab442..000dd1e 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -858,6 +858,50 @@ class RDDTests(ReusedPySparkTestCase):
 self.assertEqual(N, size)
 self.assertEqual(checksum, csum)
 
+def test_multithread_broadcast_pickle(self):
+import threading
+
+b1 = self.sc.broadcast(list(range(3)))
+b2 = self.sc.broadcast(list(range(3)))
+
+def f1():
+return b1.value
+
+def f2():
+return b2.value
+
+funcs_num_pickled = {f1: None, f2: None}
+
+def 

spark git commit: [SPARK-21712][PYSPARK] Clarify type error for Column.substr()

2017-08-15 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 42b9eda80 -> 966083105


[SPARK-21712][PYSPARK] Clarify type error for Column.substr()

Proposed changes:
* Clarify the type error that `Column.substr()` gives.

Test plan:
* Tested this manually.
* Test code:
```python
from pyspark.sql.functions import col, lit
spark.createDataFrame([['nick']], 
schema=['name']).select(col('name').substr(0, lit(1)))
```
* Before:
```
TypeError: Can not mix the type
```
* After:
```
TypeError: startPos and length must be the same type. Got  and
, respectively.
```

Author: Nicholas Chammas 

Closes #18926 from nchammas/SPARK-21712-substr-type-error.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96608310
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96608310
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96608310

Branch: refs/heads/master
Commit: 96608310501a43fa4ab9f2697f202d655dba98c5
Parents: 42b9eda
Author: Nicholas Chammas 
Authored: Wed Aug 16 11:19:15 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Aug 16 11:19:15 2017 +0900

--
 python/pyspark/sql/column.py | 10 --
 python/pyspark/sql/tests.py  | 12 
 2 files changed, 20 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/96608310/python/pyspark/sql/column.py
--
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index e753ed4..b172f38 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -406,8 +406,14 @@ class Column(object):
 [Row(col=u'Ali'), Row(col=u'Bob')]
 """
 if type(startPos) != type(length):
-raise TypeError("Can not mix the type")
-if isinstance(startPos, (int, long)):
+raise TypeError(
+"startPos and length must be the same type. "
+"Got {startPos_t} and {length_t}, respectively."
+.format(
+startPos_t=type(startPos),
+length_t=type(length),
+))
+if isinstance(startPos, int):
 jc = self._jc.substr(startPos, length)
 elif isinstance(startPos, Column):
 jc = self._jc.substr(startPos._jc, length._jc)

http://git-wip-us.apache.org/repos/asf/spark/blob/96608310/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index cf2c473..45a3f9e 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1220,6 +1220,18 @@ class SQLTests(ReusedPySparkTestCase):
 rndn2 = df.select('key', functions.randn(0)).collect()
 self.assertEqual(sorted(rndn1), sorted(rndn2))
 
+def test_string_functions(self):
+from pyspark.sql.functions import col, lit
+df = self.spark.createDataFrame([['nick']], schema=['name'])
+self.assertRaisesRegexp(
+TypeError,
+"must be the same type",
+lambda: df.select(col('name').substr(0, lit(1
+if sys.version_info.major == 2:
+self.assertRaises(
+TypeError,
+lambda: df.select(col('name').substr(long(0), long(1
+
 def test_array_contains_function(self):
 from pyspark.sql.functions import array_contains
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12

2017-08-11 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master da8c59bde -> b0bdfce9c


[MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12

## What changes were proposed in this pull request?

This is trivial, but bugged me. We should download software over HTTPS.
And we can use RAT 0.12 while at it to pick up bug fixes.

## How was this patch tested?

N/A

Author: Sean Owen 

Closes #18927 from srowen/Rat012.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b0bdfce9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b0bdfce9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b0bdfce9

Branch: refs/heads/master
Commit: b0bdfce9cae986096f327e2c7a5bdaa900dedc32
Parents: da8c59b
Author: Sean Owen 
Authored: Sat Aug 12 14:31:05 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Aug 12 14:31:05 2017 +0900

--
 dev/appveyor-install-dependencies.ps1 | 2 +-
 dev/check-license | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b0bdfce9/dev/appveyor-install-dependencies.ps1
--
diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index a357fbf..e6afb18 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -26,7 +26,7 @@ Function InstallR {
   }
 
   $urlPath = ""
-  $latestVer = $(ConvertFrom-JSON $(Invoke-WebRequest 
http://rversions.r-pkg.org/r-release-win).Content).version
+  $latestVer = $(ConvertFrom-JSON $(Invoke-WebRequest 
https://rversions.r-pkg.org/r-release-win).Content).version
   If ($rVer -ne $latestVer) {
 $urlPath = ("old/" + $rVer + "/")
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/b0bdfce9/dev/check-license
--
diff --git a/dev/check-license b/dev/check-license
index 678e73f..8cee09a 100755
--- a/dev/check-license
+++ b/dev/check-license
@@ -20,7 +20,7 @@
 
 acquire_rat_jar () {
 
-  
URL="http://repo1.maven.org/maven2/org/apache/rat/apache-rat/${RAT_VERSION}/apache-rat-${RAT_VERSION}.jar;
+  
URL="https://repo1.maven.org/maven2/org/apache/rat/apache-rat/${RAT_VERSION}/apache-rat-${RAT_VERSION}.jar;
 
   JAR="$rat_jar"
 
@@ -58,7 +58,7 @@ else
 declare java_cmd=java
 fi
 
-export RAT_VERSION=0.11
+export RAT_VERSION=0.12
 export rat_jar="$FWDIR"/lib/apache-rat-${RAT_VERSION}.jar
 mkdir -p "$FWDIR"/lib
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21658][SQL][PYSPARK] Add default None for value in na.replace in PySpark

2017-08-14 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6847e93cf -> 0fcde87aa


[SPARK-21658][SQL][PYSPARK] Add default None for value in na.replace in PySpark

## What changes were proposed in this pull request?
JIRA issue: https://issues.apache.org/jira/browse/SPARK-21658

Add default None for value in `na.replace` since `Dataframe.replace` and 
`DataframeNaFunctions.replace` are alias.

The default values are the same now.
```
>>> df = sqlContext.createDataFrame([('Alice', 10, 80.0)])
>>> df.replace({"Alice": "a"}).first()
Row(_1=u'a', _2=10, _3=80.0)
>>> df.na.replace({"Alice": "a"}).first()
Row(_1=u'a', _2=10, _3=80.0)
```

## How was this patch tested?
Existing tests.

cc viirya

Author: byakuinss 

Closes #18895 from byakuinss/SPARK-21658.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fcde87a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fcde87a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fcde87a

Branch: refs/heads/master
Commit: 0fcde87aadc9a92e138f11583119465ca4b5c518
Parents: 6847e93
Author: byakuinss 
Authored: Tue Aug 15 00:41:01 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Aug 15 00:41:01 2017 +0900

--
 python/pyspark/sql/dataframe.py | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0fcde87a/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index edc7ca6..5cd208b 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1403,6 +1403,16 @@ class DataFrame(object):
 |null|  null|null|
 ++--++
 
+>>> df4.na.replace('Alice').show()
+++--++
+| age|height|name|
+++--++
+|  10|80|null|
+|   5|  null| Bob|
+|null|  null| Tom|
+|null|  null|null|
+++--++
+
 >>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
 ++--++
 | age|height|name|
@@ -1837,7 +1847,7 @@ class DataFrameNaFunctions(object):
 
 fill.__doc__ = DataFrame.fillna.__doc__
 
-def replace(self, to_replace, value, subset=None):
+def replace(self, to_replace, value=None, subset=None):
 return self.df.replace(to_replace, value, subset)
 
 replace.__doc__ = DataFrame.replace.__doc__


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark-website git commit: Update committer page

2017-07-28 Thread gurwls223
Repository: spark-website
Updated Branches:
  refs/heads/asf-site 6ff5039f3 -> 0e09b2f58


Update committer page


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/0e09b2f5
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/0e09b2f5
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/0e09b2f5

Branch: refs/heads/asf-site
Commit: 0e09b2f580b32e16a6eef81e520e909174ebdb4d
Parents: 6ff5039
Author: hyukjinkwon 
Authored: Fri Jul 28 10:22:44 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Jul 28 10:30:43 2017 +0900

--
 committers.md| 1 +
 site/committers.html | 4 
 2 files changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/0e09b2f5/committers.md
--
diff --git a/committers.md b/committers.md
index e850f8b..a4965cb 100644
--- a/committers.md
+++ b/committers.md
@@ -30,6 +30,7 @@ navigation:
 |Shane Huang|Intel|
 |Holden Karau|IBM|
 |Andy Konwinski|Databricks|
+|Hyukjin Kwon|Mobigen|
 |Ryan LeCompte|Quantifind|
 |Haoyuan Li|Alluxio, UC Berkeley|
 |Xiao Li|Databricks|

http://git-wip-us.apache.org/repos/asf/spark-website/blob/0e09b2f5/site/committers.html
--
diff --git a/site/committers.html b/site/committers.html
index b3137ca..f69529d 100644
--- a/site/committers.html
+++ b/site/committers.html
@@ -285,6 +285,10 @@
   Databricks
 
 
+  Hyukjin Kwon
+  Mobigen
+
+
   Ryan LeCompte
   Quantifind
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARKR][BUILD] AppVeyor change to latest R version

2017-08-06 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 1ba967b25 -> d4e7f20f5


[SPARKR][BUILD] AppVeyor change to latest R version

## What changes were proposed in this pull request?

R version update

## How was this patch tested?

AppVeyor

Author: Felix Cheung 

Closes #18856 from felixcheung/rappveyorver.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4e7f20f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4e7f20f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4e7f20f

Branch: refs/heads/master
Commit: d4e7f20f5416a5c1b726337a5b2f104bd02495e3
Parents: 1ba967b
Author: Felix Cheung 
Authored: Sun Aug 6 19:51:35 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Aug 6 19:51:35 2017 +0900

--
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d4e7f20f/dev/appveyor-install-dependencies.ps1
--
diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index 1c34f1b..cf82389 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -114,7 +114,7 @@ $env:Path += ";$env:HADOOP_HOME\bin"
 Pop-Location
 
 # == R
-$rVer = "3.3.1"
+$rVer = "3.4.1"
 $rToolsVer = "3.4.0"
 
 InstallR


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR] Minor comment fixes in merge_spark_pr.py script

2017-07-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6830e90de -> f1a798b57


[MINOR] Minor comment fixes in merge_spark_pr.py script

## What changes were proposed in this pull request?

This PR proposes to fix few rather typos in `merge_spark_pr.py`.

- `#   usage: ./apache-pr-merge.py(see config env vars below)`
  -> `#   usage: ./merge_spark_pr.py(see config env vars below)`

- `... have local a Spark ...` -> `... have a local Spark ...`

- `... to Apache.` -> `... to Apache Spark.`

I skimmed this file and these look all I could find.

## How was this patch tested?

pep8 check (`./dev/lint-python`).

Author: hyukjinkwon 

Closes #18776 from HyukjinKwon/minor-merge-script.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f1a798b5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f1a798b5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f1a798b5

Branch: refs/heads/master
Commit: f1a798b5763abb5fca3aed592c3114dab5aefda2
Parents: 6830e90
Author: hyukjinkwon 
Authored: Mon Jul 31 10:07:33 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Jul 31 10:07:33 2017 +0900

--
 dev/merge_spark_pr.py | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f1a798b5/dev/merge_spark_pr.py
--
diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index 4bacb38..28971b8 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -17,10 +17,11 @@
 # limitations under the License.
 #
 
-# Utility for creating well-formed pull request merges and pushing them to 
Apache.
-#   usage: ./apache-pr-merge.py(see config env vars below)
+# Utility for creating well-formed pull request merges and pushing them to 
Apache
+# Spark.
+#   usage: ./merge_spark_pr.py(see config env vars below)
 #
-# This utility assumes you already have local a Spark git folder and that you
+# This utility assumes you already have a local Spark git folder and that you
 # have added remotes corresponding to both (i) the github apache Spark
 # mirror and (ii) the apache git repo.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][R][BUILD] More reliable detection of R version for Windows in AppVeyor

2017-08-08 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master ee1304199 -> 08ef7d718


[MINOR][R][BUILD] More reliable detection of R version for Windows in AppVeyor

## What changes were proposed in this pull request?

This PR proposes to use https://rversions.r-pkg.org/r-release-win instead of 
https://rversions.r-pkg.org/r-release to check R's version for Windows 
correctly.

We met a syncing problem with Windows release (see #15709) before. To cut this 
short, it was ...

- 3.3.2 release was released but not for Windows for few hours.
- `https://rversions.r-pkg.org/r-release` returns the latest as 3.3.2 and the 
download link for 3.3.1 becomes `windows/base/old` by our script
- 3.3.2 release for WIndows yet
- 3.3.1 is still not in `windows/base/old` but `windows/base` as the latest
- Failed to download with `windows/base/old` link and builds were broken

I believe this problem is not only what we met. Please see 
https://github.com/krlmlr/r-appveyor/commit/01ce943929993bbf27facd2cdc20ae2e03808eb4
 and also this `r-release-win` API came out between 3.3.1 and 3.3.2 (assuming 
to deal with this issue), please see 
`https://github.com/metacran/rversions.app/issues/2`.

Using this API will prevent the problem although it looks quite rare assuming 
from the commit logs in 
https://github.com/metacran/rversions.app/commits/master. After 3.3.2, both  
`r-release-win` and `r-release` are being updated together.

## How was this patch tested?

AppVeyor tests.

Author: hyukjinkwon 

Closes #18859 from HyukjinKwon/use-reliable-link.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/08ef7d71
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/08ef7d71
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/08ef7d71

Branch: refs/heads/master
Commit: 08ef7d71875378e324dd23c6d2739e606799c818
Parents: ee13041
Author: hyukjinkwon 
Authored: Tue Aug 8 23:18:59 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Aug 8 23:18:59 2017 +0900

--
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/08ef7d71/dev/appveyor-install-dependencies.ps1
--
diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index cf82389..a357fbf 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -26,7 +26,7 @@ Function InstallR {
   }
 
   $urlPath = ""
-  $latestVer = $(ConvertFrom-JSON $(Invoke-WebRequest 
http://rversions.r-pkg.org/r-release).Content).version
+  $latestVer = $(ConvertFrom-JSON $(Invoke-WebRequest 
http://rversions.r-pkg.org/r-release-win).Content).version
   If ($rVer -ne $latestVer) {
 $urlPath = ("old/" + $rVer + "/")
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [INFRA] Close stale PRs

2017-08-05 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 894d5a453 -> 3a45c7fee


[INFRA] Close stale PRs

## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with #18017

Closes #14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a 
directory …
Closes #14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to 
accelerate shuffle stage.
Closes #14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import 
reorganisation
Closes #14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should 
return Python context managers
Closes #14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS 
key…
Closes #14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on 
Pyspark examples
Closes #14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in 
lint-python
Closes #15227 - [SPARK-17655][SQL]Remove unused variables declarations and 
definations in a WholeStageCodeGened stage
Closes #15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for 
broadcast joins
Closes #15405 - [SPARK-15917][CORE] Added support for number of executors in 
Standalone [WIP]
Closes #16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user 
cancel job
Closes #16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable
Closes #16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator
Closes #16766 - [SPARK-19426][SQL] Custom coalesce for Dataset
Closes #16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive 
partition columns
Closes #17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch 
DataFrame which has an aggregation may not work
Closes #17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly
Closes #17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` 
metadata of time column
Closes #17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService
Closes #17519 - [SPARK-15352][Doc] follow-up: add configuration docs for 
topology-aware block replication
Closes #17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone
Closes #17854 - [SPARK-20564][Deploy] Reduce massive executor failures when 
executor count is large (>2000)
Closes #17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on 
number of gpus required in each spark executor when running on mesos
Closes #18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when 
executing sql statement 'insert into' on hbase table
Closes #18236 - [SPARK-21015] Check field name is not null and empty in 
GenericRowWit…
Closes #18269 - [SPARK-21056][SQL] Use at most one spark job to list files in 
InMemoryFileIndex
Closes #18328 - [SPARK-21121][SQL] Support changing storage level via the 
spark.sql.inMemoryColumnarStorage.level variable
Closes #18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: 
Constant Pool Limit - Class Splitting
Closes #18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages
Closes #18414 - [SPARK-21169] [core] Make sure to update application status to 
RUNNING if executors are accepted and RUNNING after recovery
Closes #18432 - resolve com.esotericsoftware.kryo.KryoException
Closes #18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable 
maxReqSizeShuffleToMem and KryoSerializer
Closes #18585 - SPARK-21359
Closes #18609 - Spark SQL merge small files to big files Update 
InsertIntoHiveTable.scala

Added:
Closes #18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect 
Executor I…
Closes #18599 - [SPARK-21372] spark writes one log file even I set the number 
of spark_rotate_log to 0
Closes #18619 - [SPARK-21397][BUILD]Maven shade plugin adding 
dependency-reduced-pom.xml to …
Closes #18667 - Fix the simpleString used in error messages
Closes #18782 - Branch 2.1

Added:
Closes #17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark 
broadcasts when using multiple threads

Added:
Closes #16456 - [SPARK-18994] clean up the local directories for application in 
future by annother thread
Closes #18683 - [SPARK-21474][CORE] Make number of parallel fetches from a 
reducer configurable
Closes #18690 - [SPARK-21334][CORE] Add metrics reporting service to External 
Shuffle Server

Added:
Closes #18827 - Merge pull request 1 from apache/master

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #18780 from HyukjinKwon/close-prs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a45c7fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a45c7fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a45c7fe

Branch: refs/heads/master
Commit: 3a45c7fee6190270505d32409184b6ed1ed7b52b
Parents: 894d5a4
Author: hyukjinkwon 
Authored: Sat Aug 5 21:58:38 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Aug 5 21:58:38 2017 +0900


spark git commit: [SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java

2017-08-18 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 310454be3 -> 07a2b8738


[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java

## What changes were proposed in this pull request?
Dataset.sample requires a boolean flag withReplacement as the first argument. 
However, most of the time users simply want to sample some records without 
replacement. This ticket introduces a new sample function that simply takes in 
the fraction and seed.

## How was this patch tested?
Tested manually. Not sure yet if we should add a test case for just this 
wrapper ...

Author: Reynold Xin 

Closes #18988 from rxin/SPARK-21778.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07a2b873
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07a2b873
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07a2b873

Branch: refs/heads/master
Commit: 07a2b8738ed8e6c136545d03f91a865de05e41a0
Parents: 310454b
Author: Reynold Xin 
Authored: Fri Aug 18 23:58:20 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Aug 18 23:58:20 2017 +0900

--
 .../scala/org/apache/spark/sql/Dataset.scala| 36 ++--
 1 file changed, 34 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07a2b873/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index a9887eb..615686c 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1849,10 +1849,42 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new [[Dataset]] by sampling a fraction of rows (without 
replacement),
+   * using a user-supplied seed.
+   *
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
+   * @param seed Seed for sampling.
+   *
+   * @note This is NOT guaranteed to provide exactly the fraction of the count
+   * of the given [[Dataset]].
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def sample(fraction: Double, seed: Long): Dataset[T] = {
+sample(withReplacement = false, fraction = fraction, seed = seed)
+  }
+
+  /**
+   * Returns a new [[Dataset]] by sampling a fraction of rows (without 
replacement).
+   *
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
+   *
+   * @note This is NOT guaranteed to provide exactly the fraction of the count
+   * of the given [[Dataset]].
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def sample(fraction: Double): Dataset[T] = {
+sample(withReplacement = false, fraction = fraction)
+  }
+
+  /**
* Returns a new [[Dataset]] by sampling a fraction of rows, using a 
user-supplied seed.
*
* @param withReplacement Sample with replacement or not.
-   * @param fraction Fraction of rows to generate.
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
* @param seed Seed for sampling.
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
@@ -1871,7 +1903,7 @@ class Dataset[T] private[sql](
* Returns a new [[Dataset]] by sampling a fraction of rows, using a random 
seed.
*
* @param withReplacement Sample with replacement or not.
-   * @param fraction Fraction of rows to generate.
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
*
* @note This is NOT guaranteed to provide exactly the fraction of the total 
count
* of the given [[Dataset]].


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR

2017-09-14 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 054ddb2f5 -> a28728a9a


[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to 
json for PySpark and SparkR

## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of 
`MapType`s convert to a json string but only for Scala API. In this follow-up 
PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix 
some little bugs and comments of the previous work in this follow-up PR.

### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.

cc viirya HyukjinKwon

Author: goldmedal 

Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a28728a9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a28728a9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a28728a9

Branch: refs/heads/master
Commit: a28728a9afcff94194147573e07f6f4d0463687e
Parents: 054ddb2
Author: goldmedal 
Authored: Fri Sep 15 11:53:10 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Sep 15 11:53:10 2017 +0900

--
 R/pkg/R/functions.R | 16 +++---
 R/pkg/tests/fulltests/test_sparkSQL.R   |  8 +++
 python/pyspark/sql/functions.py | 22 ++--
 .../catalyst/expressions/jsonExpressions.scala  |  8 +++
 .../sql/catalyst/json/JacksonGenerator.scala|  2 +-
 .../sql-tests/results/json-functions.sql.out|  8 +++
 6 files changed, 46 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a28728a9/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 5a46d73..e92e1fd 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -176,7 +176,8 @@ NULL
 #'
 #' @param x Column to compute on. Note the difference in the following methods:
 #'  \itemize{
-#'  \item \code{to_json}: it is the column containing the struct or 
array of the structs.
+#'  \item \code{to_json}: it is the column containing the struct, 
array of the structs,
+#'  the map or array of maps.
 #'  \item \code{from_json}: it is the column containing the JSON 
string.
 #'  }
 #' @param ... additional argument(s). In \code{to_json} and \code{from_json}, 
this contains
@@ -1700,8 +1701,9 @@ setMethod("to_date",
   })
 
 #' @details
-#' \code{to_json}: Converts a column containing a \code{structType} or array 
of \code{structType}
-#' into a Column of JSON string. Resolving the Column can fail if an 
unsupported type is encountered.
+#' \code{to_json}: Converts a column containing a \code{structType}, array of 
\code{structType},
+#' a \code{mapType} or array of \code{mapType} into a Column of JSON string.
+#' Resolving the Column can fail if an unsupported type is encountered.
 #'
 #' @rdname column_collection_functions
 #' @aliases to_json to_json,Column-method
@@ -1715,6 +1717,14 @@ setMethod("to_date",
 #'
 #' # Converts an array of structs into a JSON array
 #' df2 <- sql("SELECT array(named_struct('name', 'Bob'), named_struct('name', 
'Alice')) as people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#'
+#' # Converts a map into a JSON object
+#' df2 <- sql("SELECT map('name', 'Bob')) as people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#'
+#' # Converts an array of maps into a JSON array
+#' df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as 
people")
 #' df2 <- mutate(df2, people_json = to_json(df2$people))}
 #' @note to_json since 2.2.0
 setMethod("to_json", signature(x = "Column"),

http://git-wip-us.apache.org/repos/asf/spark/blob/a28728a9/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 

spark git commit: [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

2017-09-23 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f180b6534 -> c11f24a94


[SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

## What changes were proposed in this pull request?

Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for 
`%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. 
RELEASE file is not included in the `pip` build of PySpark.

## How was this patch tested?

Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).

Author: Jakub Nowacki 

Closes #19310 from jsnowacki/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c11f24a9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c11f24a9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c11f24a9

Branch: refs/heads/master
Commit: c11f24a94007bbaad0835645843e776507094071
Parents: f180b65
Author: Jakub Nowacki 
Authored: Sat Sep 23 21:04:10 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Sep 23 21:04:10 2017 +0900

--
 bin/spark-class2.cmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c11f24a9/bin/spark-class2.cmd
--
diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index f6157f4..a93fd2f 100644
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -29,7 +29,7 @@ if "x%1"=="x" (
 )
 
 rem Find Spark jars.
-if exist "%SPARK_HOME%\RELEASE" (
+if exist "%SPARK_HOME%\jars" (
   set SPARK_JARS_DIR="%SPARK_HOME%\jars"
 ) else (
   set 
SPARK_JARS_DIR="%SPARK_HOME%\assembly\target\scala-%SPARK_SCALA_VERSION%\jars"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

2017-09-23 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 de6274a58 -> c0a34a9ff


[SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

## What changes were proposed in this pull request?

Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for 
`%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. 
RELEASE file is not included in the `pip` build of PySpark.

## How was this patch tested?

Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).

Author: Jakub Nowacki 

Closes #19310 from jsnowacki/master.

(cherry picked from commit c11f24a94007bbaad0835645843e776507094071)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c0a34a9f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c0a34a9f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c0a34a9f

Branch: refs/heads/branch-2.2
Commit: c0a34a9fff0912b3f1ae508e43f1fae53a45afae
Parents: de6274a
Author: Jakub Nowacki 
Authored: Sat Sep 23 21:04:10 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Sep 23 21:04:26 2017 +0900

--
 bin/spark-class2.cmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c0a34a9f/bin/spark-class2.cmd
--
diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index f6157f4..a93fd2f 100644
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -29,7 +29,7 @@ if "x%1"=="x" (
 )
 
 rem Find Spark jars.
-if exist "%SPARK_HOME%\RELEASE" (
+if exist "%SPARK_HOME%\jars" (
   set SPARK_JARS_DIR="%SPARK_HOME%\jars"
 ) else (
   set 
SPARK_JARS_DIR="%SPARK_HOME%\assembly\target\scala-%SPARK_SCALA_VERSION%\jars"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

2017-09-23 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 03db72149 -> 0b3e7cc6a


[SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

## What changes were proposed in this pull request?

Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for 
`%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. 
RELEASE file is not included in the `pip` build of PySpark.

## How was this patch tested?

Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).

Author: Jakub Nowacki 

Closes #19310 from jsnowacki/master.

(cherry picked from commit c11f24a94007bbaad0835645843e776507094071)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b3e7cc6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b3e7cc6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b3e7cc6

Branch: refs/heads/branch-2.1
Commit: 0b3e7cc6ac29c38b04dbfd6d6bf81fe9e2ebd7db
Parents: 03db721
Author: Jakub Nowacki 
Authored: Sat Sep 23 21:04:10 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Sep 23 21:05:04 2017 +0900

--
 bin/spark-class2.cmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0b3e7cc6/bin/spark-class2.cmd
--
diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index f6157f4..a93fd2f 100644
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -29,7 +29,7 @@ if "x%1"=="x" (
 )
 
 rem Find Spark jars.
-if exist "%SPARK_HOME%\RELEASE" (
+if exist "%SPARK_HOME%\jars" (
   set SPARK_JARS_DIR="%SPARK_HOME%\jars"
 ) else (
   set 
SPARK_JARS_DIR="%SPARK_HOME%\assembly\target\scala-%SPARK_SCALA_VERSION%\jars"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-21 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 1da5822e6 -> a8d9ec8a6


[SPARK-21780][R] Simpler Dataset.sample API in R

## What changes were proposed in this pull request?

This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.

In short, the following examples are allowed:

```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10
```

In addition, this PR also adds some type checking logics as below:

```r
> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
  fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
  seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
  withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction 
(-1.0) must be on interval [0, 1] without replacement
```

## How was this patch tested?

Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon 

Closes #19243 from HyukjinKwon/SPARK-21780.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a8d9ec8a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a8d9ec8a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a8d9ec8a

Branch: refs/heads/master
Commit: a8d9ec8a60f21abb520b9109b238f914d2449022
Parents: 1da5822
Author: hyukjinkwon 
Authored: Thu Sep 21 20:16:25 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Sep 21 20:16:25 2017 +0900

--
 R/pkg/R/DataFrame.R   | 40 --
 R/pkg/R/generics.R|  4 +--
 R/pkg/tests/fulltests/test_sparkSQL.R | 14 +++
 3 files changed, 43 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a8d9ec8a/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 1b46c1e..0728141 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -986,10 +986,10 @@ setMethod("unique",
 #' @param x A SparkDataFrame
 #' @param withReplacement Sampling with replacement or not
 #' @param fraction The (rough) sample target fraction
-#' @param seed Randomness seed value
+#' @param seed Randomness seed value. Default is a random seed.
 #'
 #' @family SparkDataFrame functions
-#' @aliases sample,SparkDataFrame,logical,numeric-method
+#' @aliases sample,SparkDataFrame-method
 #' @rdname sample
 #' @name sample
 #' @export
@@ -998,33 +998,47 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction = 0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed = 3))
 #'}
 #' @note sample since 1.4.0
 setMethod("sample",
-  signature(x = "SparkDataFrame", withReplacement = "logical",
-fraction = "numeric"),
-  function(x, withReplacement, fraction, seed) {
-if (fraction < 0.0) stop(cat("Negative fraction value:", fraction))
+  signature(x = "SparkDataFrame"),
+  function(x, withReplacement = FALSE, fraction, seed) {
+if (!is.numeric(fraction)) {
+  stop(paste("fraction must be numeric; however, got", 
class(fraction)))
+}
+if (!is.logical(withReplacement)) {
+  stop(paste("withReplacement must be logical; however, got", 
class(withReplacement)))
+}
+
 if (!missing(seed)) {
+  if (is.null(seed)) {
+stop("seed must not be NULL or NA; however, got NULL")
+  }
+  if (is.na(seed)) {
+stop("seed must not be NULL or NA; however, got NA")
+  }
+
   # TODO : Figure out how to send integer as java.lang.Long to JVM 
so
   # we can send seed as an argument through callJMethod
-  sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, 
as.integer(seed))
+  sdf <- handledCallJMethod(x@sdf, "sample", 
as.logical(withReplacement),
+as.numeric(fraction), as.integer(seed))
 } else {
-  sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction)
+  sdf <- handledCallJMethod(x@sdf, "sample",
+as.logical(withReplacement), 
as.numeric(fraction))
 }
 dataFrame(sdf)
   })
 
 #' @rdname sample
-#' @aliases 

spark git commit: [SPARK-22086][DOCS] Add expression description for CASE WHEN

2017-09-21 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 1d1a09be9 -> 1270e7175


[SPARK-22086][DOCS] Add expression description for CASE WHEN

## What changes were proposed in this pull request?

In SQL conditional expressions, only CASE WHEN lacks for expression 
description. This patch fills the gap.

## How was this patch tested?

Only documentation change.

Author: Liang-Chi Hsieh 

Closes #19304 from viirya/casewhen-doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1270e717
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1270e717
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1270e717

Branch: refs/heads/master
Commit: 1270e71753f40c353fb726a0a3d373d181aedb40
Parents: 1d1a09b
Author: Liang-Chi Hsieh 
Authored: Thu Sep 21 22:45:06 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Sep 21 22:45:06 2017 +0900

--
 .../expressions/conditionalExpressions.scala   | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1270e717/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
index b59b6de..d95b59d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
@@ -223,7 +223,22 @@ abstract class CaseWhenBase(
  */
 // scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] 
END - When `expr1` = true, returns `expr2`; when `expr3` = true, return 
`expr4`; else return `expr5`.")
+  usage = "CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] 
END - When `expr1` = true, returns `expr2`; else when `expr3` = true, returns 
`expr4`; else returns `expr5`.",
+  arguments = """
+Arguments:
+  * expr1, expr3 - the branch condition expressions should all be boolean 
type.
+  * expr2, expr4, expr5 - the branch value expressions and else value 
expression should all be
+  same type or coercible to a common type.
+  """,
+  examples = """
+Examples:
+  > SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
+   1
+  > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
+   2
+  > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END;
+   NULL
+  """)
 // scalastyle:on line.size.limit
 case class CaseWhen(
 val branches: Seq[(Expression, Expression)],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22032][PYSPARK] Speed up StructType conversion

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 73d906722 -> f4073020a


[SPARK-22032][PYSPARK] Speed up StructType conversion

## What changes were proposed in this pull request?

StructType.fromInternal is calling f.fromInternal(v) for every field.
We can use precalculated information about type to limit the number of function 
calls. (its calculated once per StructType and used in per record calculations)

Benchmarks (Python profiler)
```
df = spark.range(1000).selectExpr("id as id0", "id as id1", "id as id2", 
"id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", 
"id as id9", "struct(id) as s").cache()
df.count()
df.rdd.map(lambda x: x).count()
```

Before
```
310274584 function calls (300272456 primitive calls) in 1320.684 seconds

Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1000  253.4170.000  486.9910.000 types.py:619()
 3000  192.2720.000 1009.9860.000 types.py:612(fromInternal)
1  176.1400.000  176.1400.000 types.py:88(fromInternal)
 2000  156.8320.000  328.0930.000 types.py:1471(_create_row)
14000  107.2060.008 1237.9170.088 {built-in method loads}
 2000   80.1760.000 1090.1620.000 types.py:1468()
```

After
```
210274584 function calls (200272456 primitive calls) in 1035.974 seconds

Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 3000  215.8450.000  698.7480.000 types.py:612(fromInternal)
 2000  165.0420.000  351.5720.000 types.py:1471(_create_row)
14000  116.8340.008  946.7910.068 {built-in method loads}
 2000   87.3260.000  786.0730.000 types.py:1468()
 2000   85.4770.000  134.6070.000 types.py:1519(__new__)
 1000   65.7770.000  126.7120.000 types.py:619()
```

Main difference is types.py:619() and types.py:88(fromInternal) 
(which is removed in After)
The number of function calls is 100 million less. And performance is 20% better.

Benchmark (worst case scenario.)

Test
```
df = spark.range(100).selectExpr("current_timestamp as id0", 
"current_timestamp as id1", "current_timestamp as id2", "current_timestamp as 
id3", "current_timestamp as id4", "current_timestamp as id5", 
"current_timestamp as id6", "current_timestamp as id7", "current_timestamp as 
id8", "current_timestamp as id9").cache()
df.count()
df.rdd.map(lambda x: x).count()
```

Before
```
31166064 function calls (31163984 primitive calls) in 150.882 seconds
```

After
```
31166064 function calls (31163984 primitive calls) in 153.220 seconds
```

IMPORTANT:
The benchmark was done on top of https://github.com/apache/spark/pull/19246.
Without https://github.com/apache/spark/pull/19246 the performance improvement 
will be even greater.

## How was this patch tested?

Existing tests.
Performance benchmark.

Author: Maciej Bryński 

Closes #19249 from maver1ck/spark_22032.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4073020
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4073020
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4073020

Branch: refs/heads/master
Commit: f4073020adf9752c7d7b39631ec3fa36d6345902
Parents: 73d9067
Author: Maciej Bryński 
Authored: Mon Sep 18 02:34:44 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 02:34:44 2017 +0900

--
 python/pyspark/sql/types.py | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f4073020/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 920cf00..aaf520f 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -483,7 +483,9 @@ class StructType(DataType):
 self.names = [f.name for f in fields]
 assert all(isinstance(f, StructField) for f in fields),\
 "fields should be a list of StructField"
-self._needSerializeAnyField = any(f.needConversion() for f in self)
+# Precalculated list of fields that need conversion with 
fromInternal/toInternal functions
+self._needConversion = [f.needConversion() for f in self]
+self._needSerializeAnyField = any(self._needConversion)
 
 def add(self, field, data_type=None, nullable=True, metadata=None):
 """
@@ -528,7 +530,9 @@ class StructType(DataType):
 data_type_f = data_type
 self.fields.append(StructField(field, data_type_f, nullable, 
metadata))
 self.names.append(field)

spark git commit: [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 51e5a821d -> 42852bb17


[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do 
not have `__len__` already) so that we can check that they are the same size. 
Normally they already are lists so this should not have a performance impact, 
but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray 

Closes #19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/42852bb1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/42852bb1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/42852bb1

Branch: refs/heads/branch-2.2
Commit: 42852bb17121fb8067a4aea3e56d56f76a2e0d1d
Parents: 51e5a82
Author: Andrew Ray 
Authored: Mon Sep 18 02:46:27 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 02:46:47 2017 +0900

--
 python/pyspark/serializers.py |  6 +-
 python/pyspark/tests.py   | 12 
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/42852bb1/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index ea5e00e..9bd4e55 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -97,7 +97,7 @@ class Serializer(object):
 
 def _load_stream_without_unbatching(self, stream):
 """
-Return an iterator of deserialized batches (lists) of objects from the 
input stream.
+Return an iterator of deserialized batches (iterable) of objects from 
the input stream.
 if the serializer does not operate on batches the default 
implementation returns an
 iterator of single element lists.
 """
@@ -326,6 +326,10 @@ class PairDeserializer(Serializer):
 key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
 val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
 for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+# For double-zipped RDDs, the batches can be iterators from other 
PairDeserializer,
+# instead of lists. We need to convert them to lists if needed.
+key_batch = key_batch if hasattr(key_batch, '__len__') else 
list(key_batch)
+val_batch = val_batch if hasattr(val_batch, '__len__') else 
list(val_batch)
 if len(key_batch) != len(val_batch):
 raise ValueError("Can not deserialize PairRDD with different 
number of items"
  " in batches: (%d, %d)" % (len(key_batch), 
len(val_batch)))

http://git-wip-us.apache.org/repos/asf/spark/blob/42852bb1/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 20a933e..9f47798 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -644,6 +644,18 @@ class RDDTests(ReusedPySparkTestCase):
 set([(x, (y, y)) for x in range(10) for y in range(10)])
 )
 
+def test_zip_chaining(self):
+# Tests for SPARK-21985
+rdd = self.sc.parallelize('abc', 2)
+self.assertSetEqual(
+set(rdd.zip(rdd).zip(rdd).collect()),
+set([((x, x), x) for x in 'abc'])
+)
+self.assertSetEqual(
+set(rdd.zip(rdd.zip(rdd)).collect()),
+set([(x, (x, x)) for x in 'abc'])
+)
+
 def test_deleting_input_files(self):
 # Regression test for SPARK-1025
 tempFile = tempfile.NamedTemporaryFile(delete=False)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 e49c997fe -> 3ae7ab8e8


[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do 
not have `__len__` already) so that we can check that they are the same size. 
Normally they already are lists so this should not have a performance impact, 
but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray 

Closes #19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ae7ab8e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ae7ab8e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ae7ab8e

Branch: refs/heads/branch-2.1
Commit: 3ae7ab8e82446e6d299a3e344beebb76ebf9dc4c
Parents: e49c997
Author: Andrew Ray 
Authored: Mon Sep 18 02:46:27 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 02:47:06 2017 +0900

--
 python/pyspark/serializers.py |  6 +-
 python/pyspark/tests.py   | 12 
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3ae7ab8e/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index ea5e00e..9bd4e55 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -97,7 +97,7 @@ class Serializer(object):
 
 def _load_stream_without_unbatching(self, stream):
 """
-Return an iterator of deserialized batches (lists) of objects from the 
input stream.
+Return an iterator of deserialized batches (iterable) of objects from 
the input stream.
 if the serializer does not operate on batches the default 
implementation returns an
 iterator of single element lists.
 """
@@ -326,6 +326,10 @@ class PairDeserializer(Serializer):
 key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
 val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
 for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+# For double-zipped RDDs, the batches can be iterators from other 
PairDeserializer,
+# instead of lists. We need to convert them to lists if needed.
+key_batch = key_batch if hasattr(key_batch, '__len__') else 
list(key_batch)
+val_batch = val_batch if hasattr(val_batch, '__len__') else 
list(val_batch)
 if len(key_batch) != len(val_batch):
 raise ValueError("Can not deserialize PairRDD with different 
number of items"
  " in batches: (%d, %d)" % (len(key_batch), 
len(val_batch)))

http://git-wip-us.apache.org/repos/asf/spark/blob/3ae7ab8e/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 25ed127..bd21029 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -579,6 +579,18 @@ class RDDTests(ReusedPySparkTestCase):
 set([(x, (y, y)) for x in range(10) for y in range(10)])
 )
 
+def test_zip_chaining(self):
+# Tests for SPARK-21985
+rdd = self.sc.parallelize('abc', 2)
+self.assertSetEqual(
+set(rdd.zip(rdd).zip(rdd).collect()),
+set([((x, x), x) for x in 'abc'])
+)
+self.assertSetEqual(
+set(rdd.zip(rdd.zip(rdd)).collect()),
+set([(x, (x, x)) for x in 'abc'])
+)
+
 def test_deleting_input_files(self):
 # Regression test for SPARK-1025
 tempFile = tempfile.NamedTemporaryFile(delete=False)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f4073020a -> 6adf67dd1


[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do 
not have `__len__` already) so that we can check that they are the same size. 
Normally they already are lists so this should not have a performance impact, 
but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray 

Closes #19226 from aray/SPARK-21985.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6adf67dd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6adf67dd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6adf67dd

Branch: refs/heads/master
Commit: 6adf67dd14b0ece342bb91adf800df0a7101e038
Parents: f407302
Author: Andrew Ray 
Authored: Mon Sep 18 02:46:27 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 02:46:27 2017 +0900

--
 python/pyspark/serializers.py |  6 +-
 python/pyspark/tests.py   | 12 
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6adf67dd/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index d5c2a75..660b19a 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -97,7 +97,7 @@ class Serializer(object):
 
 def _load_stream_without_unbatching(self, stream):
 """
-Return an iterator of deserialized batches (lists) of objects from the 
input stream.
+Return an iterator of deserialized batches (iterable) of objects from 
the input stream.
 if the serializer does not operate on batches the default 
implementation returns an
 iterator of single element lists.
 """
@@ -343,6 +343,10 @@ class PairDeserializer(Serializer):
 key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
 val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
 for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+# For double-zipped RDDs, the batches can be iterators from other 
PairDeserializer,
+# instead of lists. We need to convert them to lists if needed.
+key_batch = key_batch if hasattr(key_batch, '__len__') else 
list(key_batch)
+val_batch = val_batch if hasattr(val_batch, '__len__') else 
list(val_batch)
 if len(key_batch) != len(val_batch):
 raise ValueError("Can not deserialize PairRDD with different 
number of items"
  " in batches: (%d, %d)" % (len(key_batch), 
len(val_batch)))

http://git-wip-us.apache.org/repos/asf/spark/blob/6adf67dd/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 000dd1e..3c108ec 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -644,6 +644,18 @@ class RDDTests(ReusedPySparkTestCase):
 set([(x, (y, y)) for x in range(10) for y in range(10)])
 )
 
+def test_zip_chaining(self):
+# Tests for SPARK-21985
+rdd = self.sc.parallelize('abc', 2)
+self.assertSetEqual(
+set(rdd.zip(rdd).zip(rdd).collect()),
+set([((x, x), x) for x in 'abc'])
+)
+self.assertSetEqual(
+set(rdd.zip(rdd.zip(rdd)).collect()),
+set([(x, (x, x)) for x in 'abc'])
+)
+
 def test_deleting_input_files(self):
 # Regression test for SPARK-1025
 tempFile = tempfile.NamedTemporaryFile(delete=False)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 309c401a5 -> a86831d61


[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

## What changes were proposed in this pull request?

This PR proposes to improve error message from:

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
self.profiler_collector.show_profiles()
AttributeError: 'NoneType' object has no attribute 'show_profiles'
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
self.profiler_collector.dump_profiles(path)
AttributeError: 'NoneType' object has no attribute 'dump_profiles'
```

to

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to 
enable Python profile.
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to 
enable Python profile.
```

## How was this patch tested?

Unit tests added in `python/pyspark/tests.py` and manual tests.

Author: hyukjinkwon 

Closes #19260 from HyukjinKwon/profile-errors.

(cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a86831d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a86831d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a86831d6

Branch: refs/heads/branch-2.2
Commit: a86831d618b05c789c2cea0afe5488c3234a14bc
Parents: 309c401
Author: hyukjinkwon 
Authored: Mon Sep 18 13:20:11 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 13:20:29 2017 +0900

--
 python/pyspark/context.py | 12 ++--
 python/pyspark/tests.py   | 16 
 2 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a86831d6/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 49be76e..ea58b3a 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -994,12 +994,20 @@ class SparkContext(object):
 
 def show_profiles(self):
 """ Print the profile stats to stdout """
-self.profiler_collector.show_profiles()
+if self.profiler_collector is not None:
+self.profiler_collector.show_profiles()
+else:
+raise RuntimeError("'spark.python.profile' configuration must be 
set "
+   "to 'true' to enable Python profile.")
 
 def dump_profiles(self, path):
 """ Dump the profile stats into directory `path`
 """
-self.profiler_collector.dump_profiles(path)
+if self.profiler_collector is not None:
+self.profiler_collector.dump_profiles(path)
+else:
+raise RuntimeError("'spark.python.profile' configuration must be 
set "
+   "to 'true' to enable Python profile.")
 
 def getConf(self):
 conf = SparkConf()

http://git-wip-us.apache.org/repos/asf/spark/blob/a86831d6/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 9f47798..6a96aaf 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1288,6 +1288,22 @@ class ProfilerTests(PySparkTestCase):
 rdd.foreach(heavy_foo)
 
 
+class ProfilerTests2(unittest.TestCase):
+def test_profiler_disabled(self):
+sc = SparkContext(conf=SparkConf().set("spark.python.profile", 
"false"))
+try:
+self.assertRaisesRegexp(
+RuntimeError,
+"'spark.python.profile' configuration must be set",
+lambda: sc.show_profiles())
+self.assertRaisesRegexp(
+RuntimeError,
+"'spark.python.profile' configuration must be set",
+lambda: sc.dump_profiles("/tmp/abc"))
+finally:
+sc.stop()
+
+
 class InputFormatTests(ReusedPySparkTestCase):
 
 @classmethod



spark git commit: [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 99de4b8f5 -> b35136a9e


[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

## What changes were proposed in this pull request?

This PR proposes to improve error message from:

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
self.profiler_collector.show_profiles()
AttributeError: 'NoneType' object has no attribute 'show_profiles'
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
self.profiler_collector.dump_profiles(path)
AttributeError: 'NoneType' object has no attribute 'dump_profiles'
```

to

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to 
enable Python profile.
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to 
enable Python profile.
```

## How was this patch tested?

Unit tests added in `python/pyspark/tests.py` and manual tests.

Author: hyukjinkwon 

Closes #19260 from HyukjinKwon/profile-errors.

(cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b35136a9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b35136a9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b35136a9

Branch: refs/heads/branch-2.1
Commit: b35136a9e4b72b403434c991e111e667cfe9177d
Parents: 99de4b8
Author: hyukjinkwon 
Authored: Mon Sep 18 13:20:11 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 13:20:48 2017 +0900

--
 python/pyspark/context.py | 12 ++--
 python/pyspark/tests.py   | 16 
 2 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b35136a9/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 5a4c2fa..c091882 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -970,12 +970,20 @@ class SparkContext(object):
 
 def show_profiles(self):
 """ Print the profile stats to stdout """
-self.profiler_collector.show_profiles()
+if self.profiler_collector is not None:
+self.profiler_collector.show_profiles()
+else:
+raise RuntimeError("'spark.python.profile' configuration must be 
set "
+   "to 'true' to enable Python profile.")
 
 def dump_profiles(self, path):
 """ Dump the profile stats into directory `path`
 """
-self.profiler_collector.dump_profiles(path)
+if self.profiler_collector is not None:
+self.profiler_collector.dump_profiles(path)
+else:
+raise RuntimeError("'spark.python.profile' configuration must be 
set "
+   "to 'true' to enable Python profile.")
 
 def getConf(self):
 conf = SparkConf()

http://git-wip-us.apache.org/repos/asf/spark/blob/b35136a9/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index bd21029..61272fe 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1223,6 +1223,22 @@ class ProfilerTests(PySparkTestCase):
 rdd.foreach(heavy_foo)
 
 
+class ProfilerTests2(unittest.TestCase):
+def test_profiler_disabled(self):
+sc = SparkContext(conf=SparkConf().set("spark.python.profile", 
"false"))
+try:
+self.assertRaisesRegexp(
+RuntimeError,
+"'spark.python.profile' configuration must be set",
+lambda: sc.show_profiles())
+self.assertRaisesRegexp(
+RuntimeError,
+"'spark.python.profile' configuration must be set",
+lambda: sc.dump_profiles("/tmp/abc"))
+finally:
+sc.stop()
+
+
 class InputFormatTests(ReusedPySparkTestCase):
 
 @classmethod



spark git commit: [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

2017-09-17 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6308c65f0 -> 7c7266208


[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

## What changes were proposed in this pull request?

This PR proposes to improve error message from:

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
self.profiler_collector.show_profiles()
AttributeError: 'NoneType' object has no attribute 'show_profiles'
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
self.profiler_collector.dump_profiles(path)
AttributeError: 'NoneType' object has no attribute 'dump_profiles'
```

to

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to 
enable Python profile.
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to 
enable Python profile.
```

## How was this patch tested?

Unit tests added in `python/pyspark/tests.py` and manual tests.

Author: hyukjinkwon 

Closes #19260 from HyukjinKwon/profile-errors.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7c726620
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7c726620
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7c726620

Branch: refs/heads/master
Commit: 7c7266208a3be984ac1ce53747dc0c3640f4ecac
Parents: 6308c65
Author: hyukjinkwon 
Authored: Mon Sep 18 13:20:11 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 18 13:20:11 2017 +0900

--
 python/pyspark/context.py | 12 ++--
 python/pyspark/tests.py   | 16 
 2 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7c726620/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index a704604..a33f6dc 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -997,12 +997,20 @@ class SparkContext(object):
 
 def show_profiles(self):
 """ Print the profile stats to stdout """
-self.profiler_collector.show_profiles()
+if self.profiler_collector is not None:
+self.profiler_collector.show_profiles()
+else:
+raise RuntimeError("'spark.python.profile' configuration must be 
set "
+   "to 'true' to enable Python profile.")
 
 def dump_profiles(self, path):
 """ Dump the profile stats into directory `path`
 """
-self.profiler_collector.dump_profiles(path)
+if self.profiler_collector is not None:
+self.profiler_collector.dump_profiles(path)
+else:
+raise RuntimeError("'spark.python.profile' configuration must be 
set "
+   "to 'true' to enable Python profile.")
 
 def getConf(self):
 conf = SparkConf()

http://git-wip-us.apache.org/repos/asf/spark/blob/7c726620/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 3c108ec..da99872 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1296,6 +1296,22 @@ class ProfilerTests(PySparkTestCase):
 rdd.foreach(heavy_foo)
 
 
+class ProfilerTests2(unittest.TestCase):
+def test_profiler_disabled(self):
+sc = SparkContext(conf=SparkConf().set("spark.python.profile", 
"false"))
+try:
+self.assertRaisesRegexp(
+RuntimeError,
+"'spark.python.profile' configuration must be set",
+lambda: sc.show_profiles())
+self.assertRaisesRegexp(
+RuntimeError,
+"'spark.python.profile' configuration must be set",
+lambda: sc.dump_profiles("/tmp/abc"))
+finally:
+sc.stop()
+
+
 class InputFormatTests(ReusedPySparkTestCase):
 
 @classmethod


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, 

spark git commit: [SPARK-21766][PYSPARK][SQL] DataFrame toPandas() raises ValueError with nullable int columns

2017-09-22 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master d2b2932d8 -> 3e6a714c9


[SPARK-21766][PYSPARK][SQL] DataFrame toPandas() raises ValueError with 
nullable int columns

## What changes were proposed in this pull request?

When calling `DataFrame.toPandas()` (without Arrow enabled), if there is a 
`IntegralType` column (`IntegerType`, `ShortType`, `ByteType`) that has null 
values the following exception is thrown:

ValueError: Cannot convert non-finite values (NA or inf) to integer

This is because the null values first get converted to float NaN during the 
construction of the Pandas DataFrame in `from_records`, and then it is 
attempted to be converted back to to an integer where it fails.

The fix is going to check if the Pandas DataFrame can cause such failure when 
converting, if so, we don't do the conversion and use the inferred type by 
Pandas.

Closes #18945

## How was this patch tested?

Added pyspark test.

Author: Liang-Chi Hsieh 

Closes #19319 from viirya/SPARK-21766.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3e6a714c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3e6a714c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3e6a714c

Branch: refs/heads/master
Commit: 3e6a714c9ee97ef13b3f2010babded3b63fd9d74
Parents: d2b2932
Author: Liang-Chi Hsieh 
Authored: Fri Sep 22 22:39:47 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Sep 22 22:39:47 2017 +0900

--
 python/pyspark/sql/dataframe.py | 13 ++---
 python/pyspark/sql/tests.py | 12 
 2 files changed, 22 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3e6a714c/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 88ac413..7b81a0b 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -37,6 +37,7 @@ from pyspark.sql.types import _parse_datatype_json_string
 from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
 from pyspark.sql.readwriter import DataFrameWriter
 from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import IntegralType
 from pyspark.sql.types import *
 
 __all__ = ["DataFrame", "DataFrameNaFunctions", "DataFrameStatFunctions"]
@@ -1891,14 +1892,20 @@ class DataFrame(object):
   "if using spark.sql.execution.arrow.enable=true"
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
+pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
+
 dtype = {}
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
-if pandas_type is not None:
+# SPARK-21766: if an integer field is nullable and has null 
values, it can be
+# inferred by pandas as float column. Once we convert the 
column with NaN back
+# to integer type e.g., np.int16, we will hit exception. So we 
use the inferred
+# float type, not the corrected type from the schema in this 
case.
+if pandas_type is not None and \
+not(isinstance(field.dataType, IntegralType) and 
field.nullable and
+pdf[field.name].isnull().any()):
 dtype[field.name] = pandas_type
 
-pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
-
 for f, t in dtype.items():
 pdf[f] = pdf[f].astype(t, copy=False)
 return pdf

http://git-wip-us.apache.org/repos/asf/spark/blob/3e6a714c/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index ab76c48..3db8bee 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -2564,6 +2564,18 @@ class SQLTests(ReusedPySparkTestCase):
 self.assertEquals(types[2], np.bool)
 self.assertEquals(types[3], np.float32)
 
+@unittest.skipIf(not _have_pandas, "Pandas not installed")
+def test_to_pandas_avoid_astype(self):
+import numpy as np
+schema = StructType().add("a", IntegerType()).add("b", StringType())\
+ .add("c", IntegerType())
+data = [(1, "foo", 16777220), (None, "bar", None)]
+df = self.spark.createDataFrame(data, schema)
+types = df.toPandas().dtypes
+self.assertEquals(types[0], np.float64)  # doesn't convert to np.int32 
due to NaN value.
+self.assertEquals(types[1], np.object)
+

spark git commit: [SPARK-22049][DOCS] Confusing behavior of from_utc_timestamp and to_utc_timestamp

2017-09-20 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 2b6ff0cef -> e17901d6d


[SPARK-22049][DOCS] Confusing behavior of from_utc_timestamp and 
to_utc_timestamp

## What changes were proposed in this pull request?

Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example

## How was this patch tested?

Doc only change / existing tests

Author: Sean Owen 

Closes #19276 from srowen/SPARK-22049.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e17901d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e17901d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e17901d6

Branch: refs/heads/master
Commit: e17901d6df42edf2c7a3460995a0e954ad9a159f
Parents: 2b6ff0c
Author: Sean Owen 
Authored: Wed Sep 20 20:47:17 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 20 20:47:17 2017 +0900

--
 R/pkg/R/functions.R   | 10 ++
 python/pyspark/sql/functions.py   | 10 ++
 .../catalyst/expressions/datetimeExpressions.scala| 14 --
 .../main/scala/org/apache/spark/sql/functions.scala   | 10 ++
 4 files changed, 26 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e17901d6/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index e92e1fd..9f28626 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -2226,8 +2226,9 @@ setMethod("from_json", signature(x = "Column", schema = 
"characterOrstructType")
   })
 
 #' @details
-#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a 
certain time of day in UTC,
-#' returns another timestamp that corresponds to the same time of day in the 
given timezone.
+#' \code{from_utc_timestamp}: Given a timestamp like '2017-07-14 02:40:00.0', 
interprets it as a
+#' time in UTC, and renders that time as a timestamp in the given time zone. 
For example, 'GMT+1'
+#' would yield '2017-07-14 03:40:00.0'.
 #'
 #' @rdname column_datetime_diff_functions
 #'
@@ -2286,8 +2287,9 @@ setMethod("next_day", signature(y = "Column", x = 
"character"),
   })
 
 #' @details
-#' \code{to_utc_timestamp}: Given a timestamp, which corresponds to a certain 
time of day
-#' in the given timezone, returns another timestamp that corresponds to the 
same time of day in UTC.
+#' \code{to_utc_timestamp}: Given a timestamp like '2017-07-14 02:40:00.0', 
interprets it as a
+#' time in the given time zone, and renders that time as a timestamp in UTC. 
For example, 'GMT+1'
+#' would yield '2017-07-14 01:40:00.0'.
 #'
 #' @rdname column_datetime_diff_functions
 #' @aliases to_utc_timestamp to_utc_timestamp,Column,character-method

http://git-wip-us.apache.org/repos/asf/spark/blob/e17901d6/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 399bef0..57068fb 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -1150,8 +1150,9 @@ def unix_timestamp(timestamp=None, format='-MM-dd 
HH:mm:ss'):
 @since(1.5)
 def from_utc_timestamp(timestamp, tz):
 """
-Given a timestamp, which corresponds to a certain time of day in UTC, 
returns another timestamp
-that corresponds to the same time of day in the given timezone.
+Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in 
UTC, and renders
+that time as a timestamp in the given time zone. For example, 'GMT+1' 
would yield
+'2017-07-14 03:40:00.0'.
 
 >>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
 >>> df.select(from_utc_timestamp(df.t, 
"PST").alias('local_time')).collect()
@@ -1164,8 +1165,9 @@ def from_utc_timestamp(timestamp, tz):
 @since(1.5)
 def to_utc_timestamp(timestamp, tz):
 """
-Given a timestamp, which corresponds to a certain time of day in the given 
timezone, returns
-another timestamp that corresponds to the same time of day in UTC.
+Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in 
the given time
+zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' 
would yield
+'2017-07-14 01:40:00.0'.
 
 >>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['ts'])
 >>> df.select(to_utc_timestamp(df.ts, "PST").alias('utc_time')).collect()

http://git-wip-us.apache.org/repos/asf/spark/blob/e17901d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
--
diff --git 

spark git commit: [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows command scripts

2017-10-06 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 0c03297bf -> c7b46d4d8


[SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows command scripts

## What changes were proposed in this pull request?

All the windows command scripts can not handle quotes in parameter.

Run a windows command shell with parameter which has quotes can reproduce the 
bug:

```
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell 
--driver-java-options " -Dfile.encoding=utf-8 "
'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" 
--driver-java-options "' is not recognized as an internal or external command,
operable program or batch file.
```

Windows recognize "--driver-java-options" as part of the command.
All the Windows command script has the following code have the bug.

```
cmd /V /E /C "" %*
```

We should quote command and parameters like

```
cmd /V /E /C """ %*"
```

## How was this patch tested?

Test manually on Windows 10 and Windows 7

We can verify it by the following demo:

```
C:\Users\meng\program\demo>cat a.cmd
echo off
cmd /V /E /C "b.cmd" %*

C:\Users\meng\program\demo>cat b.cmd
echo off
echo %*

C:\Users\meng\program\demo>cat c.cmd
echo off
cmd /V /E /C ""b.cmd" %*"

C:\Users\meng\program\demo>a.cmd "123"
'b.cmd" "123' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\meng\program\demo>c.cmd "123"
"123"
```

With the spark-shell.cmd example, change it to the following code will make the 
command execute succeed.

```
cmd /V /E /C ""%~dp0spark-shell2.cmd" %*"
```

```
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell  
--driver-java-options " -Dfile.encoding=utf-8 "
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
...

```

Author: minixalpha 

Closes #19090 from minixalpha/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c7b46d4d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c7b46d4d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c7b46d4d

Branch: refs/heads/master
Commit: c7b46d4d8aa8da24131d79d2bfa36e8db19662e4
Parents: 0c03297
Author: minixalpha 
Authored: Fri Oct 6 23:38:47 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Oct 6 23:38:47 2017 +0900

--
 bin/beeline.cmd  | 4 +++-
 bin/pyspark.cmd  | 4 +++-
 bin/run-example.cmd  | 5 -
 bin/spark-class.cmd  | 4 +++-
 bin/spark-shell.cmd  | 4 +++-
 bin/spark-submit.cmd | 4 +++-
 bin/sparkR.cmd   | 4 +++-
 7 files changed, 22 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c7b46d4d/bin/beeline.cmd
--
diff --git a/bin/beeline.cmd b/bin/beeline.cmd
index 02464bd..288059a 100644
--- a/bin/beeline.cmd
+++ b/bin/beeline.cmd
@@ -17,4 +17,6 @@ rem See the License for the specific language governing 
permissions and
 rem limitations under the License.
 rem
 
-cmd /V /E /C "%~dp0spark-class.cmd" org.apache.hive.beeline.BeeLine %*
+rem The outermost quotes are used to prevent Windows command line parse error
+rem when there are some quotes in parameters, see SPARK-21877.
+cmd /V /E /C ""%~dp0spark-class.cmd" org.apache.hive.beeline.BeeLine %*"

http://git-wip-us.apache.org/repos/asf/spark/blob/c7b46d4d/bin/pyspark.cmd
--
diff --git a/bin/pyspark.cmd b/bin/pyspark.cmd
index 72d046a..3dcf1d4 100644
--- a/bin/pyspark.cmd
+++ b/bin/pyspark.cmd
@@ -20,4 +20,6 @@ rem
 rem This is the entry point for running PySpark. To avoid polluting the
 rem environment, it just launches a new cmd to do the real work.
 
-cmd /V /E /C "%~dp0pyspark2.cmd" %*
+rem The outermost quotes are used to prevent Windows command line parse error
+rem when there are some quotes in parameters, see SPARK-21877.
+cmd /V /E /C ""%~dp0pyspark2.cmd" %*"

http://git-wip-us.apache.org/repos/asf/spark/blob/c7b46d4d/bin/run-example.cmd
--
diff --git a/bin/run-example.cmd b/bin/run-example.cmd
index f9b786e..efa5f81 100644
--- a/bin/run-example.cmd
+++ b/bin/run-example.cmd
@@ -19,4 +19,7 @@ rem
 
 set SPARK_HOME=%~dp0..
 set _SPARK_CMD_USAGE=Usage: ./bin/run-example [options] example-class [example 
args]
-cmd /V /E /C "%~dp0spark-submit.cmd" run-example %*
+
+rem The outermost quotes are used to prevent Windows command line parse error
+rem when there are some quotes in parameters, see SPARK-21877.
+cmd /V /E /C ""%~dp0spark-submit.cmd" run-example %*"


spark git commit: [SPARK-20396][SQL][PYSPARK] groupby().apply() with pandas udf

2017-10-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 2028e5a82 -> bfc7e1fe1


[SPARK-20396][SQL][PYSPARK] groupby().apply() with pandas udf

## What changes were proposed in this pull request?

This PR adds an apply() function on df.groupby(). apply() takes a pandas udf 
that is a transformation on `pandas.DataFrame` -> `pandas.DataFrame`.

Static schema
---
```
schema = df.schema

pandas_udf(schema)
def normalize(df):
df = df.assign(v1 = (df.v1 - df.v1.mean()) / df.v1.std()
return df

df.groupBy('id').apply(normalize)
```
Dynamic schema
---
**This use case is removed from the PR and we will discuss this as a follow up. 
See discussion 
https://github.com/apache/spark/pull/18732#pullrequestreview-66583248**

Another example to use pd.DataFrame dtypes as output schema of the udf:

```
sample_df = df.filter(df.id == 1).toPandas()

def foo(df):
  ret = # Some transformation on the input pd.DataFrame
  return ret

foo_udf = pandas_udf(foo, foo(sample_df).dtypes)

df.groupBy('id').apply(foo_udf)
```
In interactive use case, user usually have a sample pd.DataFrame to test 
function `foo` in their notebook. Having been able to use 
`foo(sample_df).dtypes` frees user from specifying the output schema of `foo`.

Design doc: 
https://github.com/icexelloss/spark/blob/pandas-udf-doc/docs/pyspark-pandas-udf.md

## How was this patch tested?
* Added GroupbyApplyTest

Author: Li Jin 
Author: Takuya UESHIN 
Author: Bryan Cutler 

Closes #18732 from icexelloss/groupby-apply-SPARK-20396.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfc7e1fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfc7e1fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfc7e1fe

Branch: refs/heads/master
Commit: bfc7e1fe1ad5f9777126f2941e29bbe51ea5da7c
Parents: 2028e5a
Author: Li Jin 
Authored: Wed Oct 11 07:32:01 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Oct 11 07:32:01 2017 +0900

--
 python/pyspark/sql/dataframe.py |   6 +-
 python/pyspark/sql/functions.py |  98 +---
 python/pyspark/sql/group.py |  88 ++-
 python/pyspark/sql/tests.py | 157 ++-
 python/pyspark/sql/types.py |   2 +-
 python/pyspark/worker.py|  35 +++--
 .../sql/catalyst/optimizer/Optimizer.scala  |   2 +
 .../plans/logical/pythonLogicalOperators.scala  |  39 +
 .../spark/sql/RelationalGroupedDataset.scala|  36 -
 .../spark/sql/execution/SparkStrategies.scala   |   2 +
 .../execution/python/ArrowEvalPythonExec.scala  |  39 -
 .../execution/python/ArrowPythonRunner.scala|  15 +-
 .../execution/python/ExtractPythonUDFs.scala|   8 +-
 .../python/FlatMapGroupsInPandasExec.scala  | 103 
 14 files changed, 561 insertions(+), 69 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bfc7e1fe/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index fe69e58..2d59622 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1227,7 +1227,7 @@ class DataFrame(object):
 """
 jgd = self._jdf.groupBy(self._jcols(*cols))
 from pyspark.sql.group import GroupedData
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self)
 
 @since(1.4)
 def rollup(self, *cols):
@@ -1248,7 +1248,7 @@ class DataFrame(object):
 """
 jgd = self._jdf.rollup(self._jcols(*cols))
 from pyspark.sql.group import GroupedData
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self)
 
 @since(1.4)
 def cube(self, *cols):
@@ -1271,7 +1271,7 @@ class DataFrame(object):
 """
 jgd = self._jdf.cube(self._jcols(*cols))
 from pyspark.sql.group import GroupedData
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self)
 
 @since(1.3)
 def agg(self, *exprs):

http://git-wip-us.apache.org/repos/asf/spark/blob/bfc7e1fe/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index b45a59d..9bc12c3 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2058,7 +2058,7 @@ class UserDefinedFunction(object):
 self._name = name or (
 func.__name__ if hasattr(func, '__name__')
 else func.__class__.__name__)
-

spark git commit: [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

2017-10-05 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master c8affec21 -> ae61f187a


[SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider 
empty grouping attributes. It should be a problem when running 
`EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh 

Closes #19436 from viirya/fix-flatmapinr-distribution.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ae61f187
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ae61f187
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ae61f187

Branch: refs/heads/master
Commit: ae61f187aa0471242c046fdeac6ed55b9b98a3f6
Parents: c8affec
Author: Liang-Chi Hsieh 
Authored: Thu Oct 5 23:36:18 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 5 23:36:18 2017 +0900

--
 R/pkg/tests/fulltests/test_sparkSQL.R  | 5 +
 .../main/scala/org/apache/spark/sql/execution/objects.scala| 6 +-
 2 files changed, 10 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ae61f187/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 7f781f2..bbea25b 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a DataFrame", 
{
   df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
   expect_identical(df1Collect, expected)
 
+  # gapply on empty grouping columns.
+  df1 <- gapply(df, c(), function(key, x) { x }, schema(df))
+  actual <- collect(df1)
+  expect_identical(actual, expected)
+
   # Computes the sum of second column by grouping on the first and third 
columns
   # and checks if the sum is larger than 2
   schemas <- list(structType(structField("a", "integer"), structField("e", 
"boolean")),

http://git-wip-us.apache.org/repos/asf/spark/blob/ae61f187/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
index 5a3fcad..c68975b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
@@ -394,7 +394,11 @@ case class FlatMapGroupsInRExec(
   override def producedAttributes: AttributeSet = AttributeSet(outputObjAttr)
 
   override def requiredChildDistribution: Seq[Distribution] =
-ClusteredDistribution(groupingAttributes) :: Nil
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingAttributes) :: Nil
+}
 
   override def requiredChildOrdering: Seq[Seq[SortOrder]] =
 Seq(groupingAttributes.map(SortOrder(_, Ascending)))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

2017-10-05 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 5bb4e931b -> 920372a19


[SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider 
empty grouping attributes. It should be a problem when running 
`EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh 

Closes #19436 from viirya/fix-flatmapinr-distribution.

(cherry picked from commit ae61f187aa0471242c046fdeac6ed55b9b98a3f6)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/920372a1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/920372a1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/920372a1

Branch: refs/heads/branch-2.1
Commit: 920372a19994b460e1de05ca6d7f3e3acd80dd37
Parents: 5bb4e93
Author: Liang-Chi Hsieh 
Authored: Thu Oct 5 23:36:18 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 5 23:37:22 2017 +0900

--
 R/pkg/tests/fulltests/test_sparkSQL.R  | 5 +
 .../main/scala/org/apache/spark/sql/execution/objects.scala| 6 +-
 2 files changed, 10 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/920372a1/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 07de45b..fedca67 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -2556,6 +2556,11 @@ test_that("gapply() and gapplyCollect() on a DataFrame", 
{
   df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
   expect_identical(df1Collect, expected)
 
+  # gapply on empty grouping columns.
+  df1 <- gapply(df, c(), function(key, x) { x }, schema(df))
+  actual <- collect(df1)
+  expect_identical(actual, expected)
+
   # Computes the sum of second column by grouping on the first and third 
columns
   # and checks if the sum is larger than 2
   schema <- structType(structField("a", "integer"), structField("e", 
"boolean"))

http://git-wip-us.apache.org/repos/asf/spark/blob/920372a1/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
index fde3b2a..b063436 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
@@ -369,7 +369,11 @@ case class FlatMapGroupsInRExec(
   override def producedAttributes: AttributeSet = AttributeSet(outputObjAttr)
 
   override def requiredChildDistribution: Seq[Distribution] =
-ClusteredDistribution(groupingAttributes) :: Nil
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingAttributes) :: Nil
+}
 
   override def requiredChildOrdering: Seq[Seq[SortOrder]] =
 Seq(groupingAttributes.map(SortOrder(_, Ascending)))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

2017-10-05 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 81232ce03 -> 8a4e7dd89


[SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider 
empty grouping attributes. It should be a problem when running 
`EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh 

Closes #19436 from viirya/fix-flatmapinr-distribution.

(cherry picked from commit ae61f187aa0471242c046fdeac6ed55b9b98a3f6)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a4e7dd8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a4e7dd8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a4e7dd8

Branch: refs/heads/branch-2.2
Commit: 8a4e7dd896be20c560097c88ffd79c0d6c30d017
Parents: 81232ce
Author: Liang-Chi Hsieh 
Authored: Thu Oct 5 23:36:18 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 5 23:36:56 2017 +0900

--
 R/pkg/tests/fulltests/test_sparkSQL.R  | 5 +
 .../main/scala/org/apache/spark/sql/execution/objects.scala| 6 +-
 2 files changed, 10 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8a4e7dd8/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index fc69b4d..12d8fef 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -2740,6 +2740,11 @@ test_that("gapply() and gapplyCollect() on a DataFrame", 
{
   df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
   expect_identical(df1Collect, expected)
 
+  # gapply on empty grouping columns.
+  df1 <- gapply(df, c(), function(key, x) { x }, schema(df))
+  actual <- collect(df1)
+  expect_identical(actual, expected)
+
   # Computes the sum of second column by grouping on the first and third 
columns
   # and checks if the sum is larger than 2
   schema <- structType(structField("a", "integer"), structField("e", 
"boolean"))

http://git-wip-us.apache.org/repos/asf/spark/blob/8a4e7dd8/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
index 3439181..3643ef3 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
@@ -397,7 +397,11 @@ case class FlatMapGroupsInRExec(
   override def producedAttributes: AttributeSet = AttributeSet(outputObjAttr)
 
   override def requiredChildDistribution: Seq[Distribution] =
-ClusteredDistribution(groupingAttributes) :: Nil
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingAttributes) :: Nil
+}
 
   override def requiredChildOrdering: Seq[Seq[SortOrder]] =
 Seq(groupingAttributes.map(SortOrder(_, Ascending)))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22233][CORE] Allow user to filter out empty split in HadoopRDD

2017-10-14 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master e0503a722 -> 014dc8471


[SPARK-22233][CORE] Allow user to filter out empty split in HadoopRDD

## What changes were proposed in this pull request?
Add a flag spark.files.ignoreEmptySplits. When true, methods like that use 
HadoopRDD and NewHadoopRDD such as SparkContext.textFiles will not create a 
partition for input splits that are empty.

Author: liulijia 

Closes #19464 from liutang123/SPARK-22233.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/014dc847
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/014dc847
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/014dc847

Branch: refs/heads/master
Commit: 014dc8471200518d63005eed531777d30d8a6639
Parents: e0503a7
Author: liulijia 
Authored: Sat Oct 14 17:37:33 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Oct 14 17:37:33 2017 +0900

--
 .../apache/spark/internal/config/package.scala  |  6 ++
 .../scala/org/apache/spark/rdd/HadoopRDD.scala  | 12 ++-
 .../org/apache/spark/rdd/NewHadoopRDD.scala | 13 ++-
 .../test/scala/org/apache/spark/FileSuite.scala | 95 ++--
 4 files changed, 112 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/014dc847/core/src/main/scala/org/apache/spark/internal/config/package.scala
--
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index 19336f8..ce013d6 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -270,6 +270,12 @@ package object config {
 .longConf
 .createWithDefault(4 * 1024 * 1024)
 
+  private[spark] val IGNORE_EMPTY_SPLITS = 
ConfigBuilder("spark.files.ignoreEmptySplits")
+.doc("If true, methods that use HadoopRDD and NewHadoopRDD such as " +
+  "SparkContext.textFiles will not create a partition for input splits 
that are empty.")
+.booleanConf
+.createWithDefault(false)
+
   private[spark] val SECRET_REDACTION_PATTERN =
 ConfigBuilder("spark.redaction.regex")
   .doc("Regex to decide which Spark configuration properties and 
environment variables in " +

http://git-wip-us.apache.org/repos/asf/spark/blob/014dc847/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index 23b3442..1f33c0a 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -35,7 +35,7 @@ import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.deploy.SparkHadoopUtil
 import org.apache.spark.internal.Logging
-import org.apache.spark.internal.config.IGNORE_CORRUPT_FILES
+import org.apache.spark.internal.config.{IGNORE_CORRUPT_FILES, 
IGNORE_EMPTY_SPLITS}
 import org.apache.spark.rdd.HadoopRDD.HadoopMapPartitionsWithSplitRDD
 import org.apache.spark.scheduler.{HDFSCacheTaskLocation, HostTaskLocation}
 import org.apache.spark.storage.StorageLevel
@@ -134,6 +134,8 @@ class HadoopRDD[K, V](
 
   private val ignoreCorruptFiles = sparkContext.conf.get(IGNORE_CORRUPT_FILES)
 
+  private val ignoreEmptySplits = sparkContext.getConf.get(IGNORE_EMPTY_SPLITS)
+
   // Returns a JobConf that will be used on slaves to obtain input splits for 
Hadoop reads.
   protected def getJobConf(): JobConf = {
 val conf: Configuration = broadcastedConf.value.value
@@ -195,8 +197,12 @@ class HadoopRDD[K, V](
 val jobConf = getJobConf()
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
-val inputFormat = getInputFormat(jobConf)
-val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, 
minPartitions)
+val inputSplits = if (ignoreEmptySplits) {
+  allInputSplits.filter(_.getLength > 0)
+} else {
+  allInputSplits
+}
 val array = new Array[Partition](inputSplits.size)
 for (i <- 0 until inputSplits.size) {
   array(i) = new HadoopPartition(id, i, inputSplits(i))

http://git-wip-us.apache.org/repos/asf/spark/blob/014dc847/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala 

spark git commit: [SPARK-21726][SQL][FOLLOW-UP] Check for structural integrity of the plan in Optimzer in test mode

2017-09-08 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master dbb824125 -> 0dfc1ec59


[SPARK-21726][SQL][FOLLOW-UP] Check for structural integrity of the plan in 
Optimzer in test mode

## What changes were proposed in this pull request?

The condition in `Optimizer.isPlanIntegral` is wrong. We should always return 
`true` if not in test mode.

## How was this patch tested?

Manually test.

Author: Liang-Chi Hsieh 

Closes #19161 from viirya/SPARK-21726-followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0dfc1ec5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0dfc1ec5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0dfc1ec5

Branch: refs/heads/master
Commit: 0dfc1ec59e45c836cb968bc9b77c69bf0e917b06
Parents: dbb8241
Author: Liang-Chi Hsieh 
Authored: Fri Sep 8 20:21:37 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Sep 8 20:21:37 2017 +0900

--
 .../scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0dfc1ec5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 2426a8b..a602894 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -41,7 +41,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
   // Check for structural integrity of the plan in test mode. Currently we 
only check if a plan is
   // still resolved after the execution of each rule.
   override protected def isPlanIntegral(plan: LogicalPlan): Boolean = {
-Utils.isTesting && plan.resolved
+!Utils.isTesting || plan.resolved
   }
 
   protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21875][BUILD] Fix Java style bugs

2017-08-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master d8f454086 -> 313c6ca43


[SPARK-21875][BUILD] Fix Java style bugs

## What changes were proposed in this pull request?

Fix Java code style so `./dev/lint-java` succeeds

## How was this patch tested?

Run `./dev/lint-java`

Author: Andrew Ash 

Closes #19088 from ash211/spark-21875-lint-java.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/313c6ca4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/313c6ca4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/313c6ca4

Branch: refs/heads/master
Commit: 313c6ca43593e247ab8cedac15c77d13e2830d6b
Parents: d8f4540
Author: Andrew Ash 
Authored: Thu Aug 31 09:26:11 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 31 09:26:11 2017 +0900

--
 core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java | 3 ++-
 .../src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/313c6ca4/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
--
diff --git a/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
b/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
index 0f1e902..44b60c1 100644
--- a/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
+++ b/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
@@ -74,7 +74,8 @@ public class TaskMemoryManager {
* Maximum supported data page size (in bytes). In principle, the maximum 
addressable page size is
* (1L  OFFSET_BITS) bytes, which is 2+ petabytes. However, the 
on-heap allocator's
* maximum page size is limited by the maximum amount of data that can be 
stored in a long[]
-   * array, which is (2^31 - 1) * 8 bytes (or about 17 gigabytes). Therefore, 
we cap this at 17 gigabytes.
+   * array, which is (2^31 - 1) * 8 bytes (or about 17 gigabytes). Therefore, 
we cap this at 17
+   * gigabytes.
*/
   public static final long MAXIMUM_PAGE_SIZE_BYTES = ((1L << 31) - 1) * 8L;
 

http://git-wip-us.apache.org/repos/asf/spark/blob/313c6ca4/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
index 3e57403..13b006f 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
@@ -1337,7 +1337,8 @@ public class JavaDatasetSuite implements Serializable {
 public boolean equals(Object other) {
   if (other instanceof BeanWithEnum) {
 BeanWithEnum beanWithEnum = (BeanWithEnum) other;
-return beanWithEnum.regularField.equals(regularField) && 
beanWithEnum.enumField.equals(enumField);
+return beanWithEnum.regularField.equals(regularField)
+  && beanWithEnum.enumField.equals(enumField);
   }
   return false;
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21839][SQL] Support SQL config for ORC compression

2017-08-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6949a9c5c -> d8f454086


[SPARK-21839][SQL] Support SQL config for ORC compression

## What changes were proposed in this pull request?

This PR aims to support `spark.sql.orc.compression.codec` like Parquet's 
`spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC 
compression, too.

## How was this patch tested?

Pass the Jenkins with new and updated test cases.

Author: Dongjoon Hyun 

Closes #19055 from dongjoon-hyun/SPARK-21839.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8f45408
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8f45408
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8f45408

Branch: refs/heads/master
Commit: d8f45408635d4fccac557cb1e877dfe9267fb326
Parents: 6949a9c
Author: Dongjoon Hyun 
Authored: Thu Aug 31 08:16:58 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 31 08:16:58 2017 +0900

--
 python/pyspark/sql/readwriter.py|  5 ++--
 .../org/apache/spark/sql/internal/SQLConf.scala | 10 +++
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 --
 .../spark/sql/hive/orc/OrcFileFormat.scala  |  2 +-
 .../apache/spark/sql/hive/orc/OrcOptions.scala  | 18 +++-
 .../spark/sql/hive/orc/OrcSourceSuite.scala | 29 ++--
 6 files changed, 57 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d8f45408/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 01da0dc..cb847a0 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -851,8 +851,9 @@ class DataFrameWriter(OptionUtils):
 :param partitionBy: names of partitioning columns
 :param compression: compression codec to use when saving to file. This 
can be one of the
 known case-insensitive shorten names (none, 
snappy, zlib, and lzo).
-This will override ``orc.compress``. If None is 
set, it uses the
-default value, ``snappy``.
+This will override ``orc.compress`` and
+``spark.sql.orc.compression.codec``. If None is 
set, it uses the value
+specified in ``spark.sql.orc.compression.codec``.
 
 >>> orc_df = spark.read.orc('python/test_support/sql/orc_partitioned')
 >>> orc_df.write.orc(os.path.join(tempfile.mkdtemp(), 'data'))

http://git-wip-us.apache.org/repos/asf/spark/blob/d8f45408/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index a685099..c407874 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -322,6 +322,14 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val ORC_COMPRESSION = buildConf("spark.sql.orc.compression.codec")
+.doc("Sets the compression codec use when writing ORC files. Acceptable 
values include: " +
+  "none, uncompressed, snappy, zlib, lzo.")
+.stringConf
+.transform(_.toLowerCase(Locale.ROOT))
+.checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
+.createWithDefault("snappy")
+
   val ORC_FILTER_PUSHDOWN_ENABLED = buildConf("spark.sql.orc.filterPushdown")
 .doc("When true, enable filter pushdown for ORC files.")
 .booleanConf
@@ -998,6 +1006,8 @@ class SQLConf extends Serializable with Logging {
 
   def useCompression: Boolean = getConf(COMPRESS_CACHED)
 
+  def orcCompressionCodec: String = getConf(ORC_COMPRESSION)
+
   def parquetCompressionCodec: String = getConf(PARQUET_COMPRESSION)
 
   def parquetCacheMetadata: Boolean = getConf(PARQUET_CACHE_METADATA)

http://git-wip-us.apache.org/repos/asf/spark/blob/d8f45408/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index cca9352..07347d2 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -517,9 +517,11 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
*
 

spark git commit: [SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle as well in POM and SparkBuild.scala

2017-09-06 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 16c4c03c7 -> 64936c14a


[SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle 
as well in POM and SparkBuild.scala

## What changes were proposed in this pull request?

This PR proposes to match scalastyle version in POM and SparkBuild.scala

## How was this patch tested?

Manual builds.

Author: hyukjinkwon 

Closes #19146 from HyukjinKwon/SPARK-21903-follow-up.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/64936c14
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/64936c14
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/64936c14

Branch: refs/heads/master
Commit: 64936c14a7ef30b9eacb129bafe6a1665887bf21
Parents: 16c4c03
Author: hyukjinkwon 
Authored: Wed Sep 6 23:28:12 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 6 23:28:12 2017 +0900

--
 pom.xml  | 2 +-
 project/SparkBuild.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/64936c14/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 09794c1..a051fea 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2463,7 +2463,7 @@
   
 org.scalastyle
 scalastyle-maven-plugin
-0.9.0
+1.0.0
 
   false
   true

http://git-wip-us.apache.org/repos/asf/spark/blob/64936c14/project/SparkBuild.scala
--
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 20848f0..748b1c4 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -123,7 +123,7 @@ object SparkBuild extends PomBuild {
 
   lazy val scalaStyleRules = Project("scalaStyleRules", file("scalastyle"))
 .settings(
-  libraryDependencies += "org.scalastyle" %% "scalastyle" % "0.9.0"
+  libraryDependencies += "org.scalastyle" %% "scalastyle" % "1.0.0"
 )
 
   lazy val scalaStyleOnCompile = taskKey[Unit]("scalaStyleOnCompile")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Fixed pandoc dependency issue in python/setup.py

2017-09-06 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master fa0092bdd -> aad212547


Fixed pandoc dependency issue in python/setup.py

## Problem Description

When pyspark is listed as a dependency of another package, installing
the other package will cause an install failure in pyspark. When the
other package is being installed, pyspark's setup_requires requirements
are installed including pypandoc. Thus, the exception handling on
setup.py:152 does not work because the pypandoc module is indeed
available. However, the pypandoc.convert() function fails if pandoc
itself is not installed (in our use cases it is not). This raises an
OSError that is not handled, and setup fails.

The following is a sample failure:
```
$ which pandoc
$ pip freeze | grep pypandoc
pypandoc==1.4
$ pip install pyspark
Collecting pyspark
  Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
100% 
|████████████████████████████████|
 188.3MB 16.8MB/s
Complete output from command python setup.py egg_info:
Maybe try:

sudo apt-get install pandoc
See http://johnmacfarlane.net/pandoc/installing.html
for installation options
---

Traceback (most recent call last):
  File "", line 1, in 
  File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in 
long_description = pypandoc.convert('README.md', 'rst')
  File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 69, in convert
outputfile=outputfile, filters=filters)
  File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 260, in _convert_input
_ensure_pandoc_path()
  File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 544, in _ensure_pandoc_path
raise OSError("No pandoc was found: either install pandoc and add it\n"
OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.


Command "python setup.py egg_info" failed with error code 1 in 
/tmp/pip-build-mfnizcwa/pyspark/
```

## What changes were proposed in this pull request?

This change simply adds an additional exception handler for the OSError
that is raised. This allows pyspark to be installed client-side without 
requiring pandoc to be installed.

## How was this patch tested?

I tested this by building a wheel package of pyspark with the change applied. 
Then, in a clean virtual environment with pypandoc installed but pandoc not 
available on the system, I installed pyspark from the wheel.

Here is the output

```
$ pip freeze | grep pypandoc
pypandoc==1.4
$ which pandoc
$ pip install --no-cache-dir 
../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
Processing 
/home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
Requirement already satisfied: py4j==0.10.6 in 
/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from 
pyspark==2.3.0.dev0)
Installing collected packages: pyspark
Successfully installed pyspark-2.3.0.dev0
```

Author: Tucker Beck 

Closes #18981 from 
dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aad21254
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aad21254
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aad21254

Branch: refs/heads/master
Commit: aad2125475dcdeb4a0410392b6706511db17bac4
Parents: fa0092b
Author: Tucker Beck 
Authored: Thu Sep 7 09:38:00 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Sep 7 09:38:00 2017 +0900

--
 python/setup.py | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aad21254/python/setup.py
--
diff --git a/python/setup.py b/python/setup.py
index cfc83c6..02612ff 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -151,6 +151,8 @@ try:
 long_description = pypandoc.convert('README.md', 'rst')
 except ImportError:
 print("Could not import pypandoc - required to package PySpark", 
file=sys.stderr)
+except OSError:
+print("Could not convert - pandoc is not installed", file=sys.stderr)
 
 setup(
 name='pyspark',


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Fixed pandoc dependency issue in python/setup.py

2017-09-06 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 342cc2a4c -> 49968de52


Fixed pandoc dependency issue in python/setup.py

## Problem Description

When pyspark is listed as a dependency of another package, installing
the other package will cause an install failure in pyspark. When the
other package is being installed, pyspark's setup_requires requirements
are installed including pypandoc. Thus, the exception handling on
setup.py:152 does not work because the pypandoc module is indeed
available. However, the pypandoc.convert() function fails if pandoc
itself is not installed (in our use cases it is not). This raises an
OSError that is not handled, and setup fails.

The following is a sample failure:
```
$ which pandoc
$ pip freeze | grep pypandoc
pypandoc==1.4
$ pip install pyspark
Collecting pyspark
  Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
100% 
|████████████████████████████████|
 188.3MB 16.8MB/s
Complete output from command python setup.py egg_info:
Maybe try:

sudo apt-get install pandoc
See http://johnmacfarlane.net/pandoc/installing.html
for installation options
---

Traceback (most recent call last):
  File "", line 1, in 
  File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in 
long_description = pypandoc.convert('README.md', 'rst')
  File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 69, in convert
outputfile=outputfile, filters=filters)
  File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 260, in _convert_input
_ensure_pandoc_path()
  File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 544, in _ensure_pandoc_path
raise OSError("No pandoc was found: either install pandoc and add it\n"
OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.


Command "python setup.py egg_info" failed with error code 1 in 
/tmp/pip-build-mfnizcwa/pyspark/
```

## What changes were proposed in this pull request?

This change simply adds an additional exception handler for the OSError
that is raised. This allows pyspark to be installed client-side without 
requiring pandoc to be installed.

## How was this patch tested?

I tested this by building a wheel package of pyspark with the change applied. 
Then, in a clean virtual environment with pypandoc installed but pandoc not 
available on the system, I installed pyspark from the wheel.

Here is the output

```
$ pip freeze | grep pypandoc
pypandoc==1.4
$ which pandoc
$ pip install --no-cache-dir 
../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
Processing 
/home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
Requirement already satisfied: py4j==0.10.6 in 
/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from 
pyspark==2.3.0.dev0)
Installing collected packages: pyspark
Successfully installed pyspark-2.3.0.dev0
```

Author: Tucker Beck 

Closes #18981 from 
dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.

(cherry picked from commit aad2125475dcdeb4a0410392b6706511db17bac4)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/49968de5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/49968de5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/49968de5

Branch: refs/heads/branch-2.2
Commit: 49968de526e76a75abafb636cbd5ed84f9a496e9
Parents: 342cc2a
Author: Tucker Beck 
Authored: Thu Sep 7 09:38:00 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Sep 7 09:38:21 2017 +0900

--
 python/setup.py | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/49968de5/python/setup.py
--
diff --git a/python/setup.py b/python/setup.py
index f500354..7e63461 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -151,6 +151,8 @@ try:
 long_description = pypandoc.convert('README.md', 'rst')
 except ImportError:
 print("Could not import pypandoc - required to package PySpark", 
file=sys.stderr)
+except OSError:
+print("Could not convert - pandoc is not installed", file=sys.stderr)
 
 setup(
 name='pyspark',


-
To unsubscribe, e-mail: 

spark git commit: [SPARK-21513][SQL] Allow UDF to_json support converting MapType to json

2017-09-12 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 1a9857476 -> 371e4e205


[SPARK-21513][SQL] Allow UDF to_json support converting MapType to json

# What changes were proposed in this pull request?
UDF to_json only supports converting `StructType` or `ArrayType` of 
`StructType`s to a json output string now.
According to the discussion of JIRA SPARK-21513, I allow to `to_json` support 
converting `MapType` and `ArrayType` of `MapType`s to a json output string.
This PR is for SQL and Scala API only.

# How was this patch tested?
Adding unit test case.

cc viirya HyukjinKwon

Author: goldmedal 
Author: Jia-Xuan Liu 

Closes #18875 from goldmedal/SPARK-21513.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/371e4e20
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/371e4e20
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/371e4e20

Branch: refs/heads/master
Commit: 371e4e2053eb7535a27dd71756a3a479aae22306
Parents: 1a98574
Author: goldmedal 
Authored: Wed Sep 13 09:43:00 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 13 09:43:00 2017 +0900

--
 .../catalyst/expressions/jsonExpressions.scala  |  38 -
 .../sql/catalyst/json/JacksonGenerator.scala|  65 +++--
 .../expressions/JsonExpressionsSuite.scala  |  49 ++-
 .../catalyst/json/JacksonGeneratorSuite.scala   | 113 +++
 .../scala/org/apache/spark/sql/functions.scala  |  17 +--
 .../sql-tests/inputs/json-functions.sql |   5 +
 .../sql-tests/results/json-functions.sql.out| 144 ---
 .../apache/spark/sql/JsonFunctionsSuite.scala   |  16 +++
 8 files changed, 378 insertions(+), 69 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/371e4e20/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index ee5da1a..1341631 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.json._
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
-import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
BadRecordException, FailFastMode, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
BadRecordException, FailFastMode, GenericArrayData, MapData}
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 import org.apache.spark.util.Utils
@@ -604,7 +604,8 @@ case class JsonToStructs(
 }
 
 /**
- * Converts a [[StructType]] or [[ArrayType]] of [[StructType]]s to a json 
output string.
+ * Converts a [[StructType]], [[ArrayType]] of [[StructType]]s, [[MapType]]
+ * or [[ArrayType]] of [[MapType]]s to a json output string.
  */
 // scalastyle:off line.size.limit
 @ExpressionDescription(
@@ -617,6 +618,14 @@ case class JsonToStructs(
{"time":"26/08/2015"}
   > SELECT _FUNC_(array(named_struct('a', 1, 'b', 2));
[{"a":1,"b":2}]
+  > SELECT _FUNC_(map('a',named_struct('b',1)));
+   {"a":{"b":1}}
+  > SELECT _FUNC_(map(named_struct('a',1),named_struct('b',2)));
+   {"[1]":{"b":2}}
+  > SELECT _FUNC_(map('a',1));
+   {"a":1}
+  > SELECT _FUNC_(array((map('a',1;
+   [{"a":1}]
   """,
   since = "2.2.0")
 // scalastyle:on line.size.limit
@@ -648,6 +657,8 @@ case class StructsToJson(
   lazy val rowSchema = child.dataType match {
 case st: StructType => st
 case ArrayType(st: StructType, _) => st
+case mt: MapType => mt
+case ArrayType(mt: MapType, _) => mt
   }
 
   // This converts rows to the JSON output according to the given schema.
@@ -669,6 +680,14 @@ case class StructsToJson(
 (arr: Any) =>
   gen.write(arr.asInstanceOf[ArrayData])
   getAndReset()
+  case _: MapType =>
+(map: Any) =>
+  gen.write(map.asInstanceOf[MapData])
+  getAndReset()
+  case ArrayType(_: MapType, _) =>
+(arr: Any) =>
+  gen.write(arr.asInstanceOf[ArrayData])
+  getAndReset()
 }
   }
 
@@ -677,14 +696,25 @@ case class StructsToJson(
   override def checkInputDataTypes(): TypeCheckResult = child.dataType match {
 case _: StructType | 

spark git commit: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-09-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 182478e03 -> b1b5a7fdc


[SPARK-20098][PYSPARK] dataType's typeName fix

## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai 

Closes #17435 from szalai1/datatype-gettype-fix.

(cherry picked from commit 520d92a191c3148498087d751aeeddd683055622)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1b5a7fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1b5a7fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1b5a7fd

Branch: refs/heads/branch-2.2
Commit: b1b5a7fdc0f8fabfb235f0b31bde0f1bfb71591a
Parents: 182478e
Author: Peter Szalai 
Authored: Sun Sep 10 17:47:45 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Sep 10 17:48:00 2017 +0900

--
 python/pyspark/sql/tests.py | 4 
 python/pyspark/sql/types.py | 5 +
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b1b5a7fd/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index a100dc0..39655a5 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -188,6 +188,10 @@ class DataTypeTests(unittest.TestCase):
 row = Row()
 self.assertEqual(len(row), 0)
 
+def test_struct_field_type_name(self):
+struct_field = StructField("a", IntegerType())
+self.assertRaises(TypeError, struct_field.typeName)
+
 
 class SQLTests(ReusedPySparkTestCase):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b1b5a7fd/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 26b54a7..d9206dd 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -438,6 +438,11 @@ class StructField(DataType):
 def fromInternal(self, obj):
 return self.dataType.fromInternal(obj)
 
+def typeName(self):
+raise TypeError(
+"StructField does not have typeName. "
+"Use typeName on its type explicitly instead.")
+
 
 class StructType(DataType):
 """Struct type, consisting of a list of :class:`StructField`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-09-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f76790557 -> 520d92a19


[SPARK-20098][PYSPARK] dataType's typeName fix

## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai 

Closes #17435 from szalai1/datatype-gettype-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/520d92a1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/520d92a1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/520d92a1

Branch: refs/heads/master
Commit: 520d92a191c3148498087d751aeeddd683055622
Parents: f767905
Author: Peter Szalai 
Authored: Sun Sep 10 17:47:45 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Sep 10 17:47:45 2017 +0900

--
 python/pyspark/sql/tests.py | 4 
 python/pyspark/sql/types.py | 5 +
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/520d92a1/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 4d65abc..6e7ddf9 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -209,6 +209,10 @@ class DataTypeTests(unittest.TestCase):
 row = Row()
 self.assertEqual(len(row), 0)
 
+def test_struct_field_type_name(self):
+struct_field = StructField("a", IntegerType())
+self.assertRaises(TypeError, struct_field.typeName)
+
 
 class SQLTests(ReusedPySparkTestCase):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/520d92a1/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 51bf7be..920cf00 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -440,6 +440,11 @@ class StructField(DataType):
 def fromInternal(self, obj):
 return self.dataType.fromInternal(obj)
 
+def typeName(self):
+raise TypeError(
+"StructField does not have typeName. "
+"Use typeName on its type explicitly instead.")
+
 
 class StructType(DataType):
 """Struct type, consisting of a list of :class:`StructField`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type

2017-09-09 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 8a5eb5068 -> 6b45d7e94


[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of 
key type

## What changes were proposed in this pull request?

`JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. 
For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, 
when converting a map to JSON, we only care about its values and create a 
writer for the values. The keys in a map are treated as strings by calling 
`toString` on the keys.

Thus, we should change `JacksonUtils.verifySchema` to verify the value type of 
`MapType`.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh 

Closes #19167 from viirya/test-jacksonutils.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b45d7e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b45d7e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b45d7e9

Branch: refs/heads/master
Commit: 6b45d7e941eba8a36be26116787322d9e3ae25d0
Parents: 8a5eb50
Author: Liang-Chi Hsieh 
Authored: Sat Sep 9 19:10:52 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Sep 9 19:10:52 2017 +0900

--
 .../spark/sql/catalyst/json/JacksonUtils.scala  |  4 +++-
 .../expressions/JsonExpressionsSuite.scala  | 23 +++
 .../apache/spark/sql/JsonFunctionsSuite.scala   | 24 +---
 3 files changed, 47 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b45d7e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
index 3b23c6c..134d16e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
@@ -44,7 +44,9 @@ object JacksonUtils {
 
   case at: ArrayType => verifyType(name, at.elementType)
 
-  case mt: MapType => verifyType(name, mt.keyType)
+  // For MapType, its keys are treated as a string (i.e. calling 
`toString`) basically when
+  // generating JSON, so we only care if the values are valid for JSON.
+  case mt: MapType => verifyType(name, mt.valueType)
 
   case udt: UserDefinedType[_] => verifyType(name, udt.sqlType)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6b45d7e9/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
index 9991bda..5de1143 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
@@ -21,6 +21,7 @@ import java.util.Calendar
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.errors.TreeNodeException
 import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils, 
GenericArrayData, PermissiveMode}
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
@@ -610,4 +611,26 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   """{"t":"2015-12-31T16:00:00"}"""
 )
   }
+
+  test("to_json: verify MapType's value type instead of key type") {
+// Keys in map are treated as strings when converting to JSON. The type 
doesn't matter at all.
+val mapType1 = MapType(CalendarIntervalType, IntegerType)
+val schema1 = StructType(StructField("a", mapType1) :: Nil)
+val struct1 = Literal.create(null, schema1)
+checkEvaluation(
+  StructsToJson(Map.empty, struct1, gmtId),
+  null
+)
+
+// The value type must be valid for converting to JSON.
+val mapType2 = MapType(IntegerType, CalendarIntervalType)
+val schema2 = StructType(StructField("a", mapType2) :: Nil)
+val struct2 = Literal.create(null, schema2)
+intercept[TreeNodeException[_]] {
+  checkEvaluation(
+StructsToJson(Map.empty, struct2, gmtId),
+null
+  )
+}
+  }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/6b45d7e9/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

spark git commit: [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type

2017-09-09 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 987682160 -> 182478e03


[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of 
key type

## What changes were proposed in this pull request?

`JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. 
For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, 
when converting a map to JSON, we only care about its values and create a 
writer for the values. The keys in a map are treated as strings by calling 
`toString` on the keys.

Thus, we should change `JacksonUtils.verifySchema` to verify the value type of 
`MapType`.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh 

Closes #19167 from viirya/test-jacksonutils.

(cherry picked from commit 6b45d7e941eba8a36be26116787322d9e3ae25d0)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/182478e0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/182478e0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/182478e0

Branch: refs/heads/branch-2.2
Commit: 182478e030688b602bf95edfd82f700d6f5678d1
Parents: 9876821
Author: Liang-Chi Hsieh 
Authored: Sat Sep 9 19:10:52 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Sep 9 19:11:28 2017 +0900

--
 .../spark/sql/catalyst/json/JacksonUtils.scala  |  4 +++-
 .../expressions/JsonExpressionsSuite.scala  | 23 +++
 .../apache/spark/sql/JsonFunctionsSuite.scala   | 24 +---
 3 files changed, 47 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/182478e0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
index 3b23c6c..134d16e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
@@ -44,7 +44,9 @@ object JacksonUtils {
 
   case at: ArrayType => verifyType(name, at.elementType)
 
-  case mt: MapType => verifyType(name, mt.keyType)
+  // For MapType, its keys are treated as a string (i.e. calling 
`toString`) basically when
+  // generating JSON, so we only care if the values are valid for JSON.
+  case mt: MapType => verifyType(name, mt.valueType)
 
   case udt: UserDefinedType[_] => verifyType(name, udt.sqlType)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/182478e0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
index f892e80..53b54de 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
@@ -21,6 +21,7 @@ import java.util.Calendar
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.errors.TreeNodeException
 import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils, 
GenericArrayData, PermissiveMode}
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
@@ -590,4 +591,26 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   """{"t":"2015-12-31T16:00:00"}"""
 )
   }
+
+  test("to_json: verify MapType's value type instead of key type") {
+// Keys in map are treated as strings when converting to JSON. The type 
doesn't matter at all.
+val mapType1 = MapType(CalendarIntervalType, IntegerType)
+val schema1 = StructType(StructField("a", mapType1) :: Nil)
+val struct1 = Literal.create(null, schema1)
+checkEvaluation(
+  StructsToJson(Map.empty, struct1, gmtId),
+  null
+)
+
+// The value type must be valid for converting to JSON.
+val mapType2 = MapType(IntegerType, CalendarIntervalType)
+val schema2 = StructType(StructField("a", mapType2) :: Nil)
+val struct2 = Literal.create(null, schema2)
+intercept[TreeNodeException[_]] {
+  checkEvaluation(
+StructsToJson(Map.empty, struct2, gmtId),
+null
+  )
+}
+  }
 }


spark git commit: [BUILD][TEST][SPARKR] add sparksubmitsuite to appveyor tests

2017-09-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6273a711b -> 828fab035


[BUILD][TEST][SPARKR] add sparksubmitsuite to appveyor tests

## What changes were proposed in this pull request?

more file regex

## How was this patch tested?

Jenkins, AppVeyor

Author: Felix Cheung 

Closes #19177 from felixcheung/rmoduletotest.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/828fab03
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/828fab03
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/828fab03

Branch: refs/heads/master
Commit: 828fab03567ecc245a65c4d295a677ce0ba26c19
Parents: 6273a71
Author: Felix Cheung 
Authored: Mon Sep 11 09:32:25 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 11 09:32:25 2017 +0900

--
 appveyor.yml | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/828fab03/appveyor.yml
--
diff --git a/appveyor.yml b/appveyor.yml
index 43dad9b..dc2d81f 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -32,6 +32,7 @@ only_commits:
 - sql/core/src/main/scala/org/apache/spark/sql/api/r/
 - core/src/main/scala/org/apache/spark/api/r/
 - mllib/src/main/scala/org/apache/spark/ml/r/
+- core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
 
 cache:
   - C:\Users\appveyor\.m2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file

2017-09-12 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master dd7816758 -> 7d0a3ef4c


[SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when 
creating a dataframe from a file

## What changes were proposed in this pull request?

When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
This PR captures above situation and raise an exception with a reasonable 
workaround messag so that users can know what happened and how to fix the query.

## How was this patch tested?

Added unit test in `CSVSuite`.

Author: Jen-Ming Chung 

Closes #19199 from jmchung/SPARK-21610-FOLLOWUP.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d0a3ef4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d0a3ef4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d0a3ef4

Branch: refs/heads/master
Commit: 7d0a3ef4ced9684457ad6c5924c58b95249419e1
Parents: dd78167
Author: Jen-Ming Chung 
Authored: Tue Sep 12 22:47:12 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Sep 12 22:47:12 2017 +0900

--
 .../datasources/csv/CSVFileFormat.scala | 14 +++
 .../datasources/json/JsonFileFormat.scala   |  2 +-
 .../execution/datasources/csv/CSVSuite.scala| 42 
 3 files changed, 57 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d0a3ef4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
index a99bdfe..e20977a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
@@ -109,6 +109,20 @@ class CSVFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   }
 }
 
+if (requiredSchema.length == 1 &&
+  requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
+  throw new AnalysisException(
+"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed 
when the\n" +
+  "referenced columns only include the internal corrupt record 
column\n" +
+  s"(named _corrupt_record by default). For example:\n" +
+  
"spark.read.schema(schema).csv(file).filter($\"_corrupt_record\".isNotNull).count()\n"
 +
+  "and 
spark.read.schema(schema).csv(file).select(\"_corrupt_record\").show().\n" +
+  "Instead, you can cache or save the parsed results and then send the 
same query.\n" +
+  "For example, val df = spark.read.schema(schema).csv(file).cache() 
and then\n" +
+  "df.filter($\"_corrupt_record\".isNotNull).count()."
+  )
+}
+
 (file: PartitionedFile) => {
   val conf = broadcastedHadoopConf.value.value
   val parser = new UnivocityParser(

http://git-wip-us.apache.org/repos/asf/spark/blob/7d0a3ef4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
index b5ed6e4..0862c74 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
@@ -118,7 +118,7 @@ class JsonFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   throw new AnalysisException(
 "Since Spark 2.3, the queries from raw JSON/CSV files are disallowed 
when the\n" +
 "referenced columns only include the internal corrupt record column\n" 
+
-s"(named ${parsedOptions.columnNameOfCorruptRecord} by default). For 
example:\n" +
+s"(named _corrupt_record by default). For example:\n" +
 
"spark.read.schema(schema).json(file).filter($\"_corrupt_record\".isNotNull).count()\n"
 +
 "and 
spark.read.schema(schema).json(file).select(\"_corrupt_record\").show().\n" +
 "Instead, you can cache or save the parsed results and then send the 
same query.\n" +

http://git-wip-us.apache.org/repos/asf/spark/blob/7d0a3ef4/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

spark git commit: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 211d81beb -> 8acce00ac


[SPARK-22107] Change as to alias in python quickstart

## What changes were proposed in this pull request?

Updated docs so that a line of python in the quick start guide executes. Closes 
#19283

## How was this patch tested?

Existing tests.

Author: John O'Leary 

Closes #19326 from jgoleary/issues/22107.

(cherry picked from commit 20adf9aa1f42353432d356117e655e799ea1290b)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8acce00a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8acce00a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8acce00a

Branch: refs/heads/branch-2.2
Commit: 8acce00acc343bc04a0f5af4ce4717b42c8938da
Parents: 211d81b
Author: John O'Leary 
Authored: Mon Sep 25 09:16:27 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 25 09:16:46 2017 +0900

--
 docs/quick-start.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8acce00a/docs/quick-start.md
--
diff --git a/docs/quick-start.md b/docs/quick-start.md
index c4c5a5a..aac047f 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -153,7 +153,7 @@ This first maps a line to an integer value and aliases it 
as "numWords", creatin
 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can 
implement MapReduce flows easily:
 
 {% highlight python %}
->>> wordCounts = textFile.select(explode(split(textFile.value, 
"\s+")).as("word")).groupBy("word").count()
+>>> wordCounts = textFile.select(explode(split(textFile.value, 
"\s+")).alias("word")).groupBy("word").count()
 {% endhighlight %}
 
 Here, we use the `explode` function in `select`, to transfrom a Dataset of 
lines to a Dataset of words, and then combine `groupBy` and `count` to compute 
the per-word counts in the file as a DataFrame of 2 columns: "word" and 
"count". To collect the word counts in our shell, we can call `collect`:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 576c43fb4 -> 20adf9aa1


[SPARK-22107] Change as to alias in python quickstart

## What changes were proposed in this pull request?

Updated docs so that a line of python in the quick start guide executes. Closes 
#19283

## How was this patch tested?

Existing tests.

Author: John O'Leary 

Closes #19326 from jgoleary/issues/22107.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/20adf9aa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/20adf9aa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/20adf9aa

Branch: refs/heads/master
Commit: 20adf9aa1f42353432d356117e655e799ea1290b
Parents: 576c43f
Author: John O'Leary 
Authored: Mon Sep 25 09:16:27 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 25 09:16:27 2017 +0900

--
 docs/quick-start.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/20adf9aa/docs/quick-start.md
--
diff --git a/docs/quick-start.md b/docs/quick-start.md
index a85e5b2..200b972 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -153,7 +153,7 @@ This first maps a line to an integer value and aliases it 
as "numWords", creatin
 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can 
implement MapReduce flows easily:
 
 {% highlight python %}
->>> wordCounts = textFile.select(explode(split(textFile.value, 
"\s+")).as("word")).groupBy("word").count()
+>>> wordCounts = textFile.select(explode(split(textFile.value, 
"\s+")).alias("word")).groupBy("word").count()
 {% endhighlight %}
 
 Here, we use the `explode` function in `select`, to transfrom a Dataset of 
lines to a Dataset of words, and then combine `groupBy` and `count` to compute 
the per-word counts in the file as a DataFrame of 2 columns: "word" and 
"count". To collect the word counts in our shell, we can call `collect`:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark

2017-09-26 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master ceaec9383 -> 1fdfe6935


[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in 
PySpark

## What changes were proposed in this pull request?
We added a method to the scala API for creating a `DataFrame` from 
`DataSet[String]` storing CSV in 
[SPARK-15463](https://issues.apache.org/jira/browse/SPARK-15463) but PySpark 
doesn't have `Dataset` to support this feature. Therfore, I add an API to 
create a `DataFrame` from `RDD[String]` storing csv and it's also consistent 
with PySpark's `spark.read.json`.

For example as below
```
>>> rdd = sc.textFile('python/test_support/sql/ages.csv')
>>> df2 = spark.read.csv(rdd)
>>> df2.dtypes
[('_c0', 'string'), ('_c1', 'string')]
```
## How was this patch tested?
add unit test cases.

Author: goldmedal 

Closes #19339 from goldmedal/SPARK-22112.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1fdfe693
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1fdfe693
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1fdfe693

Branch: refs/heads/master
Commit: 1fdfe69352e4d4714c1f06d61d7ad475ce3a7f1f
Parents: ceaec93
Author: goldmedal 
Authored: Wed Sep 27 11:19:45 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 27 11:19:45 2017 +0900

--
 python/pyspark/sql/readwriter.py | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1fdfe693/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index cb847a0..f309291 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -335,7 +335,8 @@ class DataFrameReader(OptionUtils):
 ``inferSchema`` is enabled. To avoid going through the entire data 
once, disable
 ``inferSchema`` option or specify the schema explicitly using 
``schema``.
 
-:param path: string, or list of strings, for input path(s).
+:param path: string, or list of strings, for input path(s),
+ or RDD of Strings storing CSV rows.
 :param schema: an optional :class:`pyspark.sql.types.StructType` for 
the input schema
or a DDL-formatted string (For example ``col0 INT, col1 
DOUBLE``).
 :param sep: sets the single character as a separator for each field 
and value.
@@ -408,6 +409,10 @@ class DataFrameReader(OptionUtils):
 >>> df = spark.read.csv('python/test_support/sql/ages.csv')
 >>> df.dtypes
 [('_c0', 'string'), ('_c1', 'string')]
+>>> rdd = sc.textFile('python/test_support/sql/ages.csv')
+>>> df2 = spark.read.csv(rdd)
+>>> df2.dtypes
+[('_c0', 'string'), ('_c1', 'string')]
 """
 self._set_opts(
 schema=schema, sep=sep, encoding=encoding, quote=quote, 
escape=escape, comment=comment,
@@ -420,7 +425,29 @@ class DataFrameReader(OptionUtils):
 columnNameOfCorruptRecord=columnNameOfCorruptRecord, 
multiLine=multiLine)
 if isinstance(path, basestring):
 path = [path]
-return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+if type(path) == list:
+return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+elif isinstance(path, RDD):
+def func(iterator):
+for x in iterator:
+if not isinstance(x, basestring):
+x = unicode(x)
+if isinstance(x, unicode):
+x = x.encode("utf-8")
+yield x
+keyed = path.mapPartitions(func)
+keyed._bypass_serializer = True
+jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+# see SPARK-22112
+# There aren't any jvm api for creating a dataframe from rdd 
storing csv.
+# We can do it through creating a jvm dataset firstly and using 
the jvm api
+# for creating a dataframe from dataset storing csv.
+jdataset = self._spark._ssql_ctx.createDataset(
+jrdd.rdd(),
+self._spark._jvm.Encoders.STRING())
+return self._df(self._jreader.csv(jdataset))
+else:
+raise TypeError("path can be only string, list or RDD")
 
 @since(1.5)
 def orc(self, path):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [BUILD] Close stale PRs

2017-09-26 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f21f6ce99 -> ceaec9383


[BUILD] Close stale PRs

Closes #13794
Closes #18474
Closes #18897
Closes #18978
Closes #19152
Closes #19238
Closes #19295
Closes #19334
Closes #19335
Closes #19347
Closes #19236
Closes #19244
Closes #19300
Closes #19315
Closes #19356
Closes #15009
Closes #18253

Author: hyukjinkwon 

Closes #19348 from HyukjinKwon/stale-prs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ceaec938
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ceaec938
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ceaec938

Branch: refs/heads/master
Commit: ceaec93839d18a20e0cd78b70f3ea71872dce0a4
Parents: f21f6ce
Author: hyukjinkwon 
Authored: Wed Sep 27 09:30:25 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 27 09:30:25 2017 +0900

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests

2017-09-25 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master ce204780e -> d8e825e3b


[SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests

## What changes were proposed in this pull request?

This change disables the use of 0-parameter pandas_udfs due to the API being 
overly complex and awkward, and can easily be worked around by using an index 
column as an input argument.  Also added doctests for pandas_udfs which 
revealed bugs for handling empty partitions and using the pandas_udf decorator.

## How was this patch tested?

Reworked existing 0-parameter test to verify error is raised, added doctest for 
pandas_udf, added new tests for empty partition and decorator usage.

Author: Bryan Cutler 

Closes #19325 from BryanCutler/arrow-pandas_udf-0-param-remove-SPARK-22106.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8e825e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8e825e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8e825e3

Branch: refs/heads/master
Commit: d8e825e3bc5fdb8ba00eba431512fa7f771417f1
Parents: ce20478
Author: Bryan Cutler 
Authored: Tue Sep 26 10:54:00 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Sep 26 10:54:00 2017 +0900

--
 python/pyspark/serializers.py   | 15 +
 python/pyspark/sql/functions.py | 32 ---
 python/pyspark/sql/tests.py | 59 +++-
 python/pyspark/worker.py| 25 -
 .../execution/python/ArrowEvalPythonExec.scala  | 10 ++--
 5 files changed, 89 insertions(+), 52 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d8e825e3/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 887c702..7c1fbad 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -216,9 +216,6 @@ class ArrowPandasSerializer(ArrowSerializer):
 Serializes Pandas.Series as Arrow data.
 """
 
-def __init__(self):
-super(ArrowPandasSerializer, self).__init__()
-
 def dumps(self, series):
 """
 Make an ArrowRecordBatch from a Pandas Series and serialize. Input is 
a single series or
@@ -245,16 +242,10 @@ class ArrowPandasSerializer(ArrowSerializer):
 
 def loads(self, obj):
 """
-Deserialize an ArrowRecordBatch to an Arrow table and return as a list 
of pandas.Series
-followed by a dictionary containing length of the loaded batches.
+Deserialize an ArrowRecordBatch to an Arrow table and return as a list 
of pandas.Series.
 """
-import pyarrow as pa
-reader = pa.RecordBatchFileReader(pa.BufferReader(obj))
-batches = [reader.get_batch(i) for i in 
xrange(reader.num_record_batches)]
-# NOTE: a 0-parameter pandas_udf will produce an empty batch that can 
have num_rows set
-num_rows = sum((batch.num_rows for batch in batches))
-table = pa.Table.from_batches(batches)
-return [c.to_pandas() for c in table.itercolumns()] + [{"length": 
num_rows}]
+table = super(ArrowPandasSerializer, self).loads(obj)
+return [c.to_pandas() for c in table.itercolumns()]
 
 def __repr__(self):
 return "ArrowPandasSerializer"

http://git-wip-us.apache.org/repos/asf/spark/blob/d8e825e3/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 46e3a85..63e9a83 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2127,6 +2127,10 @@ class UserDefinedFunction(object):
 def _create_udf(f, returnType, vectorized):
 
 def _udf(f, returnType=StringType(), vectorized=vectorized):
+if vectorized:
+import inspect
+if len(inspect.getargspec(f).args) == 0:
+raise NotImplementedError("0-parameter pandas_udfs are not 
currently supported")
 udf_obj = UserDefinedFunction(f, returnType, vectorized=vectorized)
 return udf_obj._wrapped()
 
@@ -2183,14 +2187,28 @@ def pandas_udf(f=None, returnType=StringType()):
 :param f: python function if used as a standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
 
-# TODO: doctest
+>>> from pyspark.sql.types import IntegerType, StringType
+>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+>>> @pandas_udf(returnType=StringType())
+... def to_upper(s):
+... return s.str.upper()
+...
+>>> @pandas_udf(returnType="integer")
+... def add_one(x):
+...   

spark git commit: [SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r

2017-10-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master c6610a997 -> 02c91e03f


[SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of 
lint-r

## What changes were proposed in this pull request?

Currently, we set lintr to jimhester/lintra769c0b (see 
[this](https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026)
 and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).

I first tested and checked lintr-1.0.1 but it looks many important fixes are 
missing (for example, checking 100 length). So, I instead tried the latest 
commit, 
https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72,
 in my local and fixed the check failures.

It looks it has fixed many bugs and now finds many instances that I have 
observed and thought should be caught time to time, here I filed [the 
results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).

The downside looks it now takes about 7ish mins, (it was 2ish mins before) in 
my local.

## How was this patch tested?

Manually, `./dev/lint-r` after manually updating the lintr package.

Author: hyukjinkwon 
Author: zuotingbing 

Closes #19290 from HyukjinKwon/upgrade-r-lint.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02c91e03
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02c91e03
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02c91e03

Branch: refs/heads/master
Commit: 02c91e03f975c2a6a05a9d5327057bb6b3c4a66f
Parents: c6610a9
Author: hyukjinkwon 
Authored: Sun Oct 1 18:42:45 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Oct 1 18:42:45 2017 +0900

--
 R/pkg/.lintr |   2 +-
 R/pkg/R/DataFrame.R  |  30 ++---
 R/pkg/R/RDD.R|   6 +-
 R/pkg/R/WindowSpec.R |   2 +-
 R/pkg/R/column.R |   2 +
 R/pkg/R/context.R|   2 +-
 R/pkg/R/deserialize.R|   2 +-
 R/pkg/R/functions.R  |  79 +++--
 R/pkg/R/generics.R   |   4 +-
 R/pkg/R/group.R  |   4 +-
 R/pkg/R/mllib_classification.R   | 137 +-
 R/pkg/R/mllib_clustering.R   |  15 +--
 R/pkg/R/mllib_regression.R   |  62 +-
 R/pkg/R/mllib_tree.R |  36 --
 R/pkg/R/pairRDD.R|   4 +-
 R/pkg/R/schema.R |   2 +-
 R/pkg/R/stats.R  |  14 +--
 R/pkg/R/utils.R  |   4 +-
 R/pkg/inst/worker/worker.R   |   2 +-
 R/pkg/tests/fulltests/test_binary_function.R |   2 +-
 R/pkg/tests/fulltests/test_rdd.R |   6 +-
 R/pkg/tests/fulltests/test_sparkSQL.R|  14 +--
 dev/lint-r.R |   4 +-
 23 files changed, 242 insertions(+), 193 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/02c91e03/R/pkg/.lintr
--
diff --git a/R/pkg/.lintr b/R/pkg/.lintr
index ae50b28..c83ad2a 100644
--- a/R/pkg/.lintr
+++ b/R/pkg/.lintr
@@ -1,2 +1,2 @@
-linters: with_defaults(line_length_linter(100), multiple_dots_linter = NULL, 
camel_case_linter = NULL, open_curly_linter(allow_single_line = TRUE), 
closed_curly_linter(allow_single_line = TRUE))
+linters: with_defaults(line_length_linter(100), multiple_dots_linter = NULL, 
object_name_linter = NULL, camel_case_linter = NULL, 
open_curly_linter(allow_single_line = TRUE), 
closed_curly_linter(allow_single_line = TRUE))
 exclusions: list("inst/profile/general.R" = 1, "inst/profile/shell.R")

http://git-wip-us.apache.org/repos/asf/spark/blob/02c91e03/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 0728141..176bb3b 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1923,13 +1923,15 @@ setMethod("[", signature(x = "SparkDataFrame"),
 #' @param i,subset (Optional) a logical expression to filter on rows.
 #' For extract operator [[ and replacement operator [[<-, the 
indexing parameter for
 #' a single Column.
-#' @param j,select expression for the single Column or a list of columns to 
select from the SparkDataFrame.
+#' @param j,select expression for the single Column or a list of columns to 
select from the
+#' SparkDataFrame.
 #' @param drop if TRUE, a Column will be returned if the resulting dataset has 
only one column.
 #' 

spark git commit: [MINOR] Fixed up pandas_udf related docs and formatting

2017-09-27 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 9244957b5 -> 7bf4da8a3


[MINOR] Fixed up pandas_udf related docs and formatting

## What changes were proposed in this pull request?

Fixed some minor issues with pandas_udf related docs and formatting.

## How was this patch tested?

NA

Author: Bryan Cutler 

Closes #19375 from BryanCutler/arrow-pandas_udf-cleanup-minor.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7bf4da8a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7bf4da8a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7bf4da8a

Branch: refs/heads/master
Commit: 7bf4da8a33c33b03bbfddc698335fe9b86ce1e0e
Parents: 9244957
Author: Bryan Cutler 
Authored: Thu Sep 28 10:24:51 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Sep 28 10:24:51 2017 +0900

--
 python/pyspark/serializers.py   | 6 +++---
 python/pyspark/sql/functions.py | 6 ++
 2 files changed, 5 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7bf4da8a/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index db77b7e..ad18bd0 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -191,7 +191,7 @@ class FramedSerializer(Serializer):
 
 class ArrowSerializer(FramedSerializer):
 """
-Serializes an Arrow stream.
+Serializes bytes as Arrow data with the Arrow file format.
 """
 
 def dumps(self, batch):
@@ -239,7 +239,7 @@ class ArrowStreamPandasSerializer(Serializer):
 
 def dump_stream(self, iterator, stream):
 """
-Make ArrowRecordBatches from Pandas Serieses and serialize. Input is a 
single series or
+Make ArrowRecordBatches from Pandas Series and serialize. Input is a 
single series or
 a list of series accompanied by an optional pyarrow type to coerce the 
data to.
 """
 import pyarrow as pa
@@ -257,7 +257,7 @@ class ArrowStreamPandasSerializer(Serializer):
 
 def load_stream(self, stream):
 """
-Deserialize ArrowRecordBatchs to an Arrow table and return as a list 
of pandas.Series.
+Deserialize ArrowRecordBatches to an Arrow table and return as a list 
of pandas.Series.
 """
 import pyarrow as pa
 reader = pa.open_stream(stream)

http://git-wip-us.apache.org/repos/asf/spark/blob/7bf4da8a/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 63e9a83..b45a59d 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2199,16 +2199,14 @@ def pandas_udf(f=None, returnType=StringType()):
 ...
 >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", 
"age"))
 >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
-... .show() # doctest: +SKIP
+... .show()  # doctest: +SKIP
 +--+--++
 |slen(name)|to_upper(name)|add_one(age)|
 +--+--++
 | 8|  JOHN DOE|  22|
 +--+--++
 """
-wrapped_udf = _create_udf(f, returnType=returnType, vectorized=True)
-
-return wrapped_udf
+return _create_udf(f, returnType=returnType, vectorized=True)
 
 
 blacklist = ['map', 'since', 'ignore_unicode_prefix']


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22130][CORE] UTF8String.trim() scans " " twice

2017-09-27 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master d2b8b63b9 -> 12e740bba


[SPARK-22130][CORE] UTF8String.trim() scans " " twice

## What changes were proposed in this pull request?

This PR allows us to scan a string including only white space (e.g. `" "`) 
once while the current implementation scans twice (right to left, and then left 
to right).

## How was this patch tested?

Existing test suites

Author: Kazuaki Ishizaki 

Closes #19355 from kiszk/SPARK-22130.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12e740bb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12e740bb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12e740bb

Branch: refs/heads/master
Commit: 12e740bba110c6ab017c73c5ef940cce39dd45b7
Parents: d2b8b63
Author: Kazuaki Ishizaki 
Authored: Wed Sep 27 23:19:10 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 27 23:19:10 2017 +0900

--
 .../java/org/apache/spark/unsafe/types/UTF8String.java   | 11 +--
 .../org/apache/spark/unsafe/types/UTF8StringSuite.java   |  3 +++
 2 files changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/12e740bb/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
--
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index ce4a06b..b0d0c44 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -498,17 +498,16 @@ public final class UTF8String implements 
Comparable, Externalizable,
 
   public UTF8String trim() {
 int s = 0;
-int e = this.numBytes - 1;
 // skip all of the space (0x20) in the left side
 while (s < this.numBytes && getByte(s) == 0x20) s++;
-// skip all of the space (0x20) in the right side
-while (e >= 0 && getByte(e) == 0x20) e--;
-if (s > e) {
+if (s == this.numBytes) {
   // empty string
   return EMPTY_UTF8;
-} else {
-  return copyUTF8String(s, e);
 }
+// skip all of the space (0x20) in the right side
+int e = this.numBytes - 1;
+while (e > s && getByte(e) == 0x20) e--;
+return copyUTF8String(s, e);
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/12e740bb/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
--
diff --git 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
index 7b03d2c..9b303fa 100644
--- 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
+++ 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
@@ -222,10 +222,13 @@ public class UTF8StringSuite {
 
   @Test
   public void trims() {
+assertEquals(fromString("1"), fromString("1").trim());
+
 assertEquals(fromString("hello"), fromString("  hello ").trim());
 assertEquals(fromString("hello "), fromString("  hello ").trimLeft());
 assertEquals(fromString("  hello"), fromString("  hello ").trimRight());
 
+assertEquals(EMPTY_UTF8, EMPTY_UTF8.trim());
 assertEquals(EMPTY_UTF8, fromString("  ").trim());
 assertEquals(EMPTY_UTF8, fromString("  ").trimLeft());
 assertEquals(EMPTY_UTF8, fromString("  ").trimRight());


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF.

2017-09-27 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 12e740bba -> 09cbf3df2


[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF.

## What changes were proposed in this pull request?

Currently we use Arrow File format to communicate with Python worker when 
invoking vectorized UDF but we can use Arrow Stream format.

This pr replaces the Arrow File format with the Arrow Stream format.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN 

Closes #19349 from ueshin/issues/SPARK-22125.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/09cbf3df
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/09cbf3df
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/09cbf3df

Branch: refs/heads/master
Commit: 09cbf3df20efea09c0941499249b7a3b2bf7e9fd
Parents: 12e740b
Author: Takuya UESHIN 
Authored: Wed Sep 27 23:21:44 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Sep 27 23:21:44 2017 +0900

--
 .../org/apache/spark/api/python/PythonRDD.scala | 325 +-
 .../apache/spark/api/python/PythonRunner.scala  | 441 +++
 python/pyspark/serializers.py   |  70 +--
 python/pyspark/worker.py|   4 +-
 .../sql/execution/vectorized/ColumnarBatch.java |   5 +
 .../execution/python/ArrowEvalPythonExec.scala  |  54 ++-
 .../execution/python/ArrowPythonRunner.scala| 181 
 .../execution/python/BatchEvalPythonExec.scala  |   4 +-
 .../sql/execution/python/PythonUDFRunner.scala  | 113 +
 9 files changed, 825 insertions(+), 372 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/09cbf3df/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
index 86d0405..f6293c0 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
@@ -48,7 +48,7 @@ private[spark] class PythonRDD(
   extends RDD[Array[Byte]](parent) {
 
   val bufferSize = conf.getInt("spark.buffer.size", 65536)
-  val reuse_worker = conf.getBoolean("spark.python.worker.reuse", true)
+  val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
 
   override def getPartitions: Array[Partition] = firstParent.partitions
 
@@ -59,7 +59,7 @@ private[spark] class PythonRDD(
   val asJavaRDD: JavaRDD[Array[Byte]] = JavaRDD.fromRDD(this)
 
   override def compute(split: Partition, context: TaskContext): 
Iterator[Array[Byte]] = {
-val runner = PythonRunner(func, bufferSize, reuse_worker)
+val runner = PythonRunner(func, bufferSize, reuseWorker)
 runner.compute(firstParent.iterator(split, context), split.index, context)
   }
 }
@@ -83,318 +83,9 @@ private[spark] case class PythonFunction(
  */
 private[spark] case class ChainedPythonFunctions(funcs: Seq[PythonFunction])
 
-/**
- * Enumerate the type of command that will be sent to the Python worker
- */
-private[spark] object PythonEvalType {
-  val NON_UDF = 0
-  val SQL_BATCHED_UDF = 1
-  val SQL_PANDAS_UDF = 2
-}
-
-private[spark] object PythonRunner {
-  def apply(func: PythonFunction, bufferSize: Int, reuse_worker: Boolean): 
PythonRunner = {
-new PythonRunner(
-  Seq(ChainedPythonFunctions(Seq(func))),
-  bufferSize,
-  reuse_worker,
-  PythonEvalType.NON_UDF,
-  Array(Array(0)))
-  }
-}
-
-/**
- * A helper class to run Python mapPartition/UDFs in Spark.
- *
- * funcs is a list of independent Python functions, each one of them is a list 
of chained Python
- * functions (from bottom to top).
- */
-private[spark] class PythonRunner(
-funcs: Seq[ChainedPythonFunctions],
-bufferSize: Int,
-reuse_worker: Boolean,
-evalType: Int,
-argOffsets: Array[Array[Int]])
-  extends Logging {
-
-  require(funcs.length == argOffsets.length, "argOffsets should have the same 
length as funcs")
-
-  // All the Python functions should have the same exec, version and envvars.
-  private val envVars = funcs.head.funcs.head.envVars
-  private val pythonExec = funcs.head.funcs.head.pythonExec
-  private val pythonVer = funcs.head.funcs.head.pythonVer
-
-  // TODO: support accumulator in multiple UDF
-  private val accumulator = funcs.head.funcs.head.accumulator
-
-  def compute(
-  inputIterator: Iterator[_],
-  partitionIndex: Int,
-  context: TaskContext): Iterator[Array[Byte]] = {
-val startTime = System.currentTimeMillis
-val env = SparkEnv.get
-val localdir = env.blockManager.diskBlockManager.localDirs.map(f => 
f.getPath()).mkString(",")
-

spark git commit: [SPARK-22093][TESTS] Fixes `assume` in `UtilsSuite` and `HiveDDLSuite`

2017-09-24 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 2274d84ef -> 9d48bd0b3


[SPARK-22093][TESTS] Fixes `assume` in `UtilsSuite` and `HiveDDLSuite`

## What changes were proposed in this pull request?

This PR proposes to remove `assume` in `Utils.resolveURIs` and replace `assume` 
to `assert` in `Utils.resolveURI` in the test cases in `UtilsSuite`.

It looks `Utils.resolveURIs` supports multiple but also single paths as input. 
So, it looks not meaningful to check if the input has `,`.

For the test for `Utils.resolveURI`, I replaced it to `assert` because it looks 
taking single path and in order to prevent future mistakes when adding more 
tests here.

For `assume` in `HiveDDLSuite`, it looks it should be `assert` to test at the 
last
## How was this patch tested?

Fixed unit tests.

Author: hyukjinkwon 

Closes #19332 from HyukjinKwon/SPARK-22093.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d48bd0b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d48bd0b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d48bd0b

Branch: refs/heads/master
Commit: 9d48bd0b34c4b704e29eefd6409f1cf3ed7935d3
Parents: 2274d84
Author: hyukjinkwon 
Authored: Sun Sep 24 17:11:29 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Sep 24 17:11:29 2017 +0900

--
 core/src/test/scala/org/apache/spark/util/UtilsSuite.scala| 3 +--
 .../scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala  | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9d48bd0b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala 
b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
index 05d58d8..2b16cc4 100644
--- a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
+++ b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
@@ -460,7 +460,7 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
   test("resolveURI") {
 def assertResolves(before: String, after: String): Unit = {
   // This should test only single paths
-  assume(before.split(",").length === 1)
+  assert(before.split(",").length === 1)
   def resolve(uri: String): String = Utils.resolveURI(uri).toString
   assert(resolve(before) === after)
   assert(resolve(after) === after)
@@ -488,7 +488,6 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
   test("resolveURIs with multiple paths") {
 def assertResolves(before: String, after: String): Unit = {
-  assume(before.split(",").length > 1)
   def resolve(uri: String): String = Utils.resolveURIs(uri)
   assert(resolve(before) === after)
   assert(resolve(after) === after)

http://git-wip-us.apache.org/repos/asf/spark/blob/9d48bd0b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
index ee64bc9..668da5f 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
@@ -676,7 +676,7 @@ class HiveDDLSuite
   |""".stripMargin)
 val newPart = catalog.getPartition(TableIdentifier("boxes"), Map("width" 
-> "4"))
 assert(newPart.storage.serde == Some(expectedSerde))
-assume(newPart.storage.properties.filterKeys(expectedSerdeProps.contains) 
==
+assert(newPart.storage.properties.filterKeys(expectedSerdeProps.contains) 
==
   expectedSerdeProps)
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one

2017-08-24 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 846bc61cf -> 95713eb4f


[SPARK-21804][SQL] json_tuple returns null values within repeated columns 
except the first one

## What changes were proposed in this pull request?

When json_tuple in extracting values from JSON it returns null values within 
repeated columns except the first one as below:

``` scala
scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
'a')""").show()
+---+---++
| c0| c1|  c2|
+---+---++
|  1|  2|null|
+---+---++
```

I think this should be consistent with Hive's implementation:
```
hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
...
11
```

In this PR, we located all the matched indices in `fieldNames` instead of 
returning the first matched index, i.e., indexOf.

## How was this patch tested?

Added test in JsonExpressionsSuite.

Author: Jen-Ming Chung 

Closes #19017 from jmchung/SPARK-21804.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/95713eb4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/95713eb4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/95713eb4

Branch: refs/heads/master
Commit: 95713eb4f22de4e16617a605f74a1d6373ed270b
Parents: 846bc61
Author: Jen-Ming Chung 
Authored: Thu Aug 24 19:24:00 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 24 19:24:00 2017 +0900

--
 .../sql/catalyst/expressions/jsonExpressions.scala  | 12 ++--
 .../sql/catalyst/expressions/JsonExpressionsSuite.scala | 10 ++
 2 files changed, 20 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/95713eb4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c375737..ee5da1a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -436,7 +436,8 @@ case class JsonTuple(children: Seq[Expression])
 while (parser.nextToken() != JsonToken.END_OBJECT) {
   if (parser.getCurrentToken == JsonToken.FIELD_NAME) {
 // check to see if this field is desired in the output
-val idx = fieldNames.indexOf(parser.getCurrentName)
+val jsonField = parser.getCurrentName
+var idx = fieldNames.indexOf(jsonField)
 if (idx >= 0) {
   // it is, copy the child tree to the correct location in the output 
row
   val output = new ByteArrayOutputStream()
@@ -447,7 +448,14 @@ case class JsonTuple(children: Seq[Expression])
   generator => copyCurrentStructure(generator, parser)
 }
 
-row(idx) = UTF8String.fromBytes(output.toByteArray)
+val jsonValue = UTF8String.fromBytes(output.toByteArray)
+
+// SPARK-21804: json_tuple returns null values within repeated 
columns
+// except the first one; so that we need to check the remaining 
fields.
+do {
+  row(idx) = jsonValue
+  idx = fieldNames.indexOf(jsonField, idx + 1)
+} while (idx >= 0)
   }
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/95713eb4/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
index 1cd2b4f..9991bda 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
@@ -373,6 +373,16 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   InternalRow(UTF8String.fromString("1"), null, 
UTF8String.fromString("2")))
   }
 
+  test("SPARK-21804: json_tuple returns null values within repeated columns 
except the first one") {
+checkJsonTuple(
+  JsonTuple(Literal("""{"f1": 1, "f2": 2}""") ::
+NonFoldableLiteral("f1") ::
+NonFoldableLiteral("cast(NULL AS STRING)") ::
+NonFoldableLiteral("f1") ::
+Nil),
+  InternalRow(UTF8String.fromString("1"), null, 
UTF8String.fromString("1")))
+  }
+
   val gmtId = 

spark git commit: [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column

2017-08-24 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 95713eb4f -> dc5d34d8d


[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should 
validate input types for column

## What changes were proposed in this pull request?

While preparing to take over https://github.com/apache/spark/pull/16537, I 
realised a (I think) better approach to make the exception handling in one 
point.

This PR proposes to fix `_to_java_column` in `pyspark.sql.column`, which most 
of functions in `functions.py` and some other APIs use. This `_to_java_column` 
basically looks not working with other types than `pyspark.sql.column.Column` 
or string (`str` and `unicode`).

If this is not `Column`, then it calls `_create_column_from_name` which calls 
`functions.col` within JVM:

https://github.com/apache/spark/blob/42b9eda80e975d970c3e8da4047b318b83dd269f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L76

And it looks we only have `String` one with `col`.

So, these should work:

```python
>>> from pyspark.sql.column import _to_java_column, Column
>>> _to_java_column("a")
JavaObject id=o28
>>> _to_java_column(u"a")
JavaObject id=o29
>>> _to_java_column(spark.range(1).id)
JavaObject id=o33
```

whereas these do not:

```python
>>> _to_java_column(1)
```
```
...
py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
...
```

```python
>>> _to_java_column([])
```
```
...
py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
...
```

```python
>>> class A(): pass
>>> _to_java_column(A())
```
```
...
AttributeError: 'A' object has no attribute '_get_object_id'
```

Meaning most of functions using `_to_java_column` such as `udf` or `to_json` or 
some other APIs throw an exception as below:

```python
>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
```

```
...
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
...
```

```python
>>> from pyspark.sql.functions import to_json
>>> to_json(None)
```

```
...
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
...
```

**After this PR**:

```python
>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
...
```

```
TypeError: Invalid argument, not a string or column: None of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' 
functions.
```

```python
>>> from pyspark.sql.functions import to_json
>>> to_json(None)
```

```
...
TypeError: Invalid argument, not a string or column: None of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' 
functions.
```

## How was this patch tested?

Unit tests added in `python/pyspark/sql/tests.py` and manual tests.

Author: hyukjinkwon 
Author: zero323 

Closes #19027 from HyukjinKwon/SPARK-19165.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc5d34d8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc5d34d8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc5d34d8

Branch: refs/heads/master
Commit: dc5d34d8dcd6526d1dfdac8606661561c7576a62
Parents: 95713eb
Author: hyukjinkwon 
Authored: Thu Aug 24 20:29:03 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 24 20:29:03 2017 +0900

--
 python/pyspark/sql/column.py |  8 +++-
 python/pyspark/sql/tests.py  | 25 +
 2 files changed, 32 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dc5d34d8/python/pyspark/sql/column.py
--
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index b172f38..43b38a2 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -44,8 +44,14 @@ def _create_column_from_name(name):
 def _to_java_column(col):
 if isinstance(col, Column):
 jcol = col._jc
-else:
+elif isinstance(col, basestring):
 jcol = _create_column_from_name(col)
+else:
+raise TypeError(
+"Invalid argument, not a string or column: "
+"{0} of type {1}. "
+"For column literals, use 'lit', 'array', 'struct' or 'create_map' 
"
+"function.".format(col, type(col)))
 return jcol
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dc5d34d8/python/pyspark/sql/tests.py

spark git commit: [SPARK-21070][PYSPARK] Attempt to update cloudpickle again

2017-08-21 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master c108a5d30 -> 751f51336


[SPARK-21070][PYSPARK] Attempt to update cloudpickle again

## What changes were proposed in this pull request?

Based on https://github.com/apache/spark/pull/18282 by rgbkrk this PR attempts 
to update to the current released cloudpickle and minimize the difference 
between Spark cloudpickle and "stock" cloud pickle with the goal of eventually 
using the stock cloud pickle.

Some notable changes:
* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, 
cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic 
(cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 
(cloudpipe/cloudpickle#32)
* ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- 
while we don't use this internally, and have no tests or documentation for its 
use, downstream code may use __transient__, although it has never been part of 
the API, if we merge this we should include a note about this in the release 
notes.
* Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
* BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)

## How was this patch tested?

Existing PySpark unit tests + the unit tests from the cloudpickle project on 
their own.

Author: Holden Karau 
Author: Kyle Kelley 

Closes #18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/751f5133
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/751f5133
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/751f5133

Branch: refs/heads/master
Commit: 751f513367ae776c6d6815e1ce138078924872eb
Parents: c108a5d
Author: Kyle Kelley 
Authored: Tue Aug 22 11:17:53 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Aug 22 11:17:53 2017 +0900

--
 python/pyspark/cloudpickle.py | 599 +++--
 1 file changed, 446 insertions(+), 153 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/751f5133/python/pyspark/cloudpickle.py
--
diff --git a/python/pyspark/cloudpickle.py b/python/pyspark/cloudpickle.py
index 389bee7..40e91a2 100644
--- a/python/pyspark/cloudpickle.py
+++ b/python/pyspark/cloudpickle.py
@@ -9,10 +9,10 @@ The goals of it follow:
 It does not include an unpickler, as standard python unpickling suffices.
 
 This module was extracted from the `cloud` package, developed by `PiCloud, Inc.
-`_.
+`_.
 
 Copyright (c) 2012, Regents of the University of California.
-Copyright (c) 2009 `PiCloud, Inc. `_.
+Copyright (c) 2009 `PiCloud, Inc. 
`_.
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
@@ -42,18 +42,19 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 """
 from __future__ import print_function
 
-import operator
-import opcode
-import os
+import dis
+from functools import partial
+import imp
 import io
+import itertools
+import logging
+import opcode
+import operator
 import pickle
 import struct
 import sys
-import types
-from functools import partial
-import itertools
-import dis
 import traceback
+import types
 import weakref
 
 from pyspark.util import _exception_message
@@ -71,6 +72,92 @@ else:
 from io import BytesIO as StringIO
 PY3 = True
 
+
+def _make_cell_set_template_code():
+"""Get the Python compiler to emit LOAD_FAST(arg); STORE_DEREF
+
+Notes
+-
+In Python 3, we could use an easier function:
+
+.. code-block:: python
+
+   def f():
+   cell = None
+
+   def _stub(value):
+   nonlocal cell
+   cell = value
+
+   return _stub
+
+_cell_set_template_code = f()
+
+This function is _only_ a LOAD_FAST(arg); STORE_DEREF, but that is
+invalid syntax on Python 2. If we use this function we also don't need
+to do the weird freevars/cellvars swap below
+"""
+def inner(value):
+lambda: cell  # make ``cell`` a closure so that we get a STORE_DEREF
+cell 

spark git commit: [MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL doc gen script

2017-08-25 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 522e1f80d -> 3b66b1c44


[MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL 
doc gen script

## What changes were proposed in this pull request?

This PR proposes both:

- Add information about Javadoc, SQL docs and few more information in 
`docs/README.md` and a comment in `docs/_plugins/copy_api_dirs.rb` related with 
Javadoc.

- Adds some commands so that the script always runs the SQL docs build under 
`./sql` directory (for directly running `./sql/create-docs.sh` in the root 
directory).

## How was this patch tested?

Manual tests with `jekyll build` and `./sql/create-docs.sh` in the root 
directory.

Author: hyukjinkwon 

Closes #19019 from HyukjinKwon/minor-doc-build.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b66b1c4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b66b1c4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b66b1c4

Branch: refs/heads/master
Commit: 3b66b1c44060fb0ebf292830b08f71e990779800
Parents: 522e1f8
Author: hyukjinkwon 
Authored: Sat Aug 26 13:56:24 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Aug 26 13:56:24 2017 +0900

--
 docs/README.md | 70 +
 docs/_plugins/copy_api_dirs.rb |  2 +-
 sql/create-docs.sh |  4 +++
 3 files changed, 45 insertions(+), 31 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3b66b1c4/docs/README.md
--
diff --git a/docs/README.md b/docs/README.md
index 866364f..225bb1b 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -9,19 +9,22 @@ documentation yourself. Why build it yourself? So that you 
have the docs that co
 whichever version of Spark you currently have checked out of revision control.
 
 ## Prerequisites
-The Spark documentation build uses a number of tools to build HTML docs and 
API docs in Scala,
-Python and R.
+
+The Spark documentation build uses a number of tools to build HTML docs and 
API docs in Scala, Java,
+Python, R and SQL.
 
 You need to have 
[Ruby](https://www.ruby-lang.org/en/documentation/installation/) and
 
[Python](https://docs.python.org/2/using/unix.html#getting-and-installing-the-latest-version-of-python)
 installed. Also install the following libraries:
+
 ```sh
-$ sudo gem install jekyll jekyll-redirect-from pygments.rb
-$ sudo pip install Pygments
-# Following is needed only for generating API docs
-$ sudo pip install sphinx pypandoc mkdocs
-$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "roxygen2", 
"testthat", "rmarkdown"), repos="http://cran.stat.ucla.edu/;)'
+$ sudo gem install jekyll jekyll-redirect-from pygments.rb
+$ sudo pip install Pygments
+# Following is needed only for generating API docs
+$ sudo pip install sphinx pypandoc mkdocs
+$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "roxygen2", 
"testthat", "rmarkdown"), repos="http://cran.stat.ucla.edu/;)'
 ```
+
 (Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to 
replace gem with gem2.0)
 
 ## Generating the Documentation HTML
@@ -32,42 +35,49 @@ the source code and be captured by revision control 
(currently git). This way th
 includes the version of the documentation that is relevant regardless of which 
version or release
 you have checked out or downloaded.
 
-In this directory you will find textfiles formatted using Markdown, with an 
".md" suffix. You can
-read those text files directly if you want. Start with index.md.
+In this directory you will find text files formatted using Markdown, with an 
".md" suffix. You can
+read those text files directly if you want. Start with `index.md`.
 
 Execute `jekyll build` from the `docs/` directory to compile the site. 
Compiling the site with
-Jekyll will create a directory called `_site` containing index.html as well as 
the rest of the
+Jekyll will create a directory called `_site` containing `index.html` as well 
as the rest of the
 compiled files.
 
-$ cd docs
-$ jekyll build
+```sh
+$ cd docs
+$ jekyll build
+```
 
 You can modify the default Jekyll build as follows:
+
 ```sh
-# Skip generating API docs (which takes a while)
-$ SKIP_API=1 jekyll build
-
-# Serve content locally on port 4000
-$ jekyll serve --watch
-
-# Build the site with extra features used on the live page
-$ PRODUCTION=1 jekyll build
+# Skip generating API docs (which takes a while)
+$ SKIP_API=1 jekyll build
+
+# Serve content locally on port 4000
+$ jekyll serve --watch
+
+# Build the site with extra features used on the live page
+$ PRODUCTION=1 jekyll build
 ```
 
-## API Docs (Scaladoc, Sphinx, 

spark git commit: [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build

2017-08-20 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 73e04ecc4 -> 41e0eb71a


[SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL 
documentation build

## What changes were proposed in this pull request?

This PR proposes to install `mkdocs` by `pip install` if missing in the path. 
Mainly to fix Jenkins's documentation build failure in `spark-master-docs`. See 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.

It also adds `mkdocs` as requirements in `docs/README.md`.

## How was this patch tested?

I manually ran `jekyll build` under `docs` directory after manually removing 
`mkdocs` via `pip uninstall mkdocs`.

Also, tested this in the same way but on CentOS Linux release 7.3.1611 (Core) 
where I built Spark few times but never built documentation before and `mkdocs` 
is not installed.

```
...
Moving back into docs dir.
Moving to SQL directory and building docs.
Missing mkdocs in your path, trying to install mkdocs for SQL documentation 
generation.
Collecting mkdocs
  Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
100% 
|████████████████████████████████|
 1.2MB 574kB/s
Requirement already satisfied: PyYAML>=3.10 in 
/usr/lib64/python2.7/site-packages (from mkdocs)
Collecting livereload>=2.5.1 (from mkdocs)
  Downloading livereload-2.5.1-py2-none-any.whl
Collecting tornado>=4.1 (from mkdocs)
  Downloading tornado-4.5.1.tar.gz (483kB)
100% 
|████████████████████████████████|
 491kB 1.4MB/s
Collecting Markdown>=2.3.1 (from mkdocs)
  Downloading Markdown-2.6.9.tar.gz (271kB)
100% 
|████████████████████████████████|
 276kB 2.4MB/s
Collecting click>=3.3 (from mkdocs)
  Downloading click-6.7-py2.py3-none-any.whl (71kB)
100% 
|████████████████████████████████|
 71kB 2.8MB/s
Requirement already satisfied: Jinja2>=2.7.1 in 
/usr/lib/python2.7/site-packages (from mkdocs)
Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from 
livereload>=2.5.1->mkdocs)
Requirement already satisfied: backports.ssl_match_hostname in 
/usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
Collecting singledispatch (from tornado>=4.1->mkdocs)
  Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
Collecting certifi (from tornado>=4.1->mkdocs)
  Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
100% 
|████████████████████████████████|
 358kB 2.1MB/s
Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
  Downloading backports_abc-0.5-py2.py3-none-any.whl
Requirement already satisfied: MarkupSafe>=0.23 in 
/usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
Building wheels for collected packages: tornado, Markdown
  Running setup.py bdist_wheel for tornado ... done
  Stored in directory: 
/root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
  Running setup.py bdist_wheel for Markdown ... done
  Stored in directory: 
/root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
Successfully built tornado Markdown
Installing collected packages: singledispatch, certifi, backports-abc, tornado, 
livereload, Markdown, click, mkdocs
Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 
click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
Generating markdown files for SQL documentation.
Generating HTML files for SQL documentation.
INFO-  Cleaning site directory
INFO-  Building documentation to directory: .../spark/sql/site
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
Source: .../spark/docs
   Destination: .../spark/docs/_site
  Generating...
done.
 Auto-regeneration: disabled. Use --watch to enable.
 ```

Author: hyukjinkwon 

Closes #18984 from HyukjinKwon/sql-doc-mkdocs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e0eb71
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e0eb71
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e0eb71

Branch: refs/heads/master
Commit: 41e0eb71a63140c9a44a7d2f32821f02abd62367
Parents: 73e04ec
Author: hyukjinkwon 
Authored: Sun Aug 20 19:48:04 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Aug 20 19:48:04 2017 +0900

--
 docs/README.md | 2 +-
 sql/create-docs.sh | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
--



spark git commit: [SPARK-21764][TESTS] Fix tests failures on Windows: resources not being closed and incorrect paths

2017-08-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 734ed7a7b -> b30a11a6a


[SPARK-21764][TESTS] Fix tests failures on Windows: resources not being closed 
and incorrect paths

## What changes were proposed in this pull request?

`org.apache.spark.deploy.RPackageUtilsSuite`

```
 - jars without manifest return false *** FAILED *** (109 milliseconds)
   java.io.IOException: Unable to delete file: 
C:\projects\spark\target\tmp\1500266936418-0\dep1-c.jar
```

`org.apache.spark.deploy.SparkSubmitSuite`

```
 - download one file to local *** FAILED *** (16 milliseconds)
   java.net.URISyntaxException: Illegal character in authority at index 6: 
s3a://C:\projects\spark\target\tmp\test2630198944759847458.jar

 - download list of files to local *** FAILED *** (0 milliseconds)
   java.net.URISyntaxException: Illegal character in authority at index 6: 
s3a://C:\projects\spark\target\tmp\test2783551769392880031.jar
```

`org.apache.spark.scheduler.ReplayListenerSuite`

```
 - Replay compressed inprogress log file succeeding on partial read (156 
milliseconds)
   Exception encountered when attempting to run a suite with class name:
   org.apache.spark.scheduler.ReplayListenerSuite *** ABORTED *** (1 second, 
391 milliseconds)
   java.io.IOException: Failed to delete: 
C:\projects\spark\target\tmp\spark-8f3cacd6-faad-4121-b901-ba1bba8025a0

 - End-to-end replay *** FAILED *** (62 milliseconds)
   java.io.IOException: No FileSystem for scheme: C

 - End-to-end replay with compression *** FAILED *** (110 milliseconds)
   java.io.IOException: No FileSystem for scheme: C
```

`org.apache.spark.sql.hive.StatisticsSuite`

```
 - SPARK-21079 - analyze table with location different than that of individual 
partitions *** FAILED *** (875 milliseconds)
   org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);

 - SPARK-21079 - analyze partitioned table with only a subset of partitions 
visible *** FAILED *** (47 milliseconds)
   org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);
```

**Note:** this PR does not fix:

`org.apache.spark.deploy.SparkSubmitSuite`

```
 - launch simple application with spark-submit with redaction *** FAILED *** 
(172 milliseconds)
   java.util.NoSuchElementException: next on empty iterator
```

I can't reproduce this on my Windows machine but looks appearntly consistently 
failed on AppVeyor. This one is unclear to me yet and hard to debug so I did 
not include this one for now.

**Note:** it looks there are more instances but it is hard to identify them 
partly due to flakiness and partly due to swarming logs and errors. Will 
probably go one more time if it is fine.

## How was this patch tested?

Manually via AppVeyor:

**Before**

- `org.apache.spark.deploy.RPackageUtilsSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/8t8ra3lrljuir7q4
- `org.apache.spark.deploy.SparkSubmitSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/taquy84yudjjen64
- `org.apache.spark.scheduler.ReplayListenerSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/24omrfn2k0xfa9xq
- `org.apache.spark.sql.hive.StatisticsSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/2079y1plgj76dc9l

**After**

- `org.apache.spark.deploy.RPackageUtilsSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/3803dbfn89ne1164
- `org.apache.spark.deploy.SparkSubmitSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/m5l350dp7u9a4xjr
- `org.apache.spark.scheduler.ReplayListenerSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/565vf74pp6bfdk18
- `org.apache.spark.sql.hive.StatisticsSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/qm78tsk8c37jb6s4

Jenkins tests are required and AppVeyor tests will be triggered.

Author: hyukjinkwon 

Closes #18971 from HyukjinKwon/windows-fixes.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b30a11a6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b30a11a6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b30a11a6

Branch: refs/heads/master
Commit: b30a11a6acf4b1512b5759f21ae58e69662ba455
Parents: 734ed7a
Author: hyukjinkwon 
Authored: Wed Aug 30 21:35:52 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Aug 30 21:35:52 2017 +0900

--
 .../spark/deploy/RPackageUtilsSuite.scala   |  7 +--
 .../apache/spark/deploy/SparkSubmitSuite.scala  |  4 +-
 

spark-website git commit: Update committer page

2017-08-29 Thread gurwls223
Repository: spark-website
Updated Branches:
  refs/remotes/apache/asf-site [created] 1895d5cb0


Update committer page


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/1895d5cb
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/1895d5cb
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/1895d5cb

Branch: refs/remotes/apache/asf-site
Commit: 1895d5cb0cf79a507e4f14f626de585aa7b2534b
Parents: 35eb147
Author: jerryshao 
Authored: Tue Aug 29 16:31:38 2017 +0800
Committer: jerryshao 
Committed: Tue Aug 29 16:57:48 2017 +0800

--
 committers.md| 1 +
 site/committers.html | 4 
 2 files changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/1895d5cb/committers.md
--
diff --git a/committers.md b/committers.md
index 040d419..54aac4d 100644
--- a/committers.md
+++ b/committers.md
@@ -50,6 +50,7 @@ navigation:
 |Josh Rosen|Databricks|
 |Sandy Ryza|Remix|
 |Kousuke Saruta|NTT Data|
+|Saisai Shao|Hortonworks|
 |Prashant Sharma|IBM|
 |Ram Sriharsha|Databricks|
 |DB Tsai|Netflix|

http://git-wip-us.apache.org/repos/asf/spark-website/blob/1895d5cb/site/committers.html
--
diff --git a/site/committers.html b/site/committers.html
index 4ca12ce..770487c 100644
--- a/site/committers.html
+++ b/site/committers.html
@@ -365,6 +365,10 @@
   NTT Data
 
 
+  Saisai Shao
+  Hortonworks
+
+
   Prashant Sharma
   IBM
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R

2017-09-03 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master acb7fed23 -> 07fd68a29


[SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R

## What changes were proposed in this pull request?

This PR proposes to add a wrapper for `unionByName` API to R and Python as well.

**Python**

```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```

```
++++
|col0|col1|col3|
++++
|   1|   2|   3|
|   6|   4|   5|
++++
```

**R**

```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```

```
  carb am gear
14  14
24  14
34  14
44  14
```

## How was this patch tested?

Doctests for Python and unit test added in `test_sparkSQL.R` for R.

Author: hyukjinkwon 

Closes #19105 from HyukjinKwon/unionByName-r-python.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07fd68a2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07fd68a2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07fd68a2

Branch: refs/heads/master
Commit: 07fd68a29fb6cad960b5ac72718bb05decf28a1a
Parents: acb7fed
Author: hyukjinkwon 
Authored: Sun Sep 3 21:03:21 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Sep 3 21:03:21 2017 +0900

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/DataFrame.R   | 38 --
 R/pkg/R/generics.R|  4 
 R/pkg/tests/fulltests/test_sparkSQL.R |  9 ++-
 python/pyspark/sql/dataframe.py   | 28 +++---
 5 files changed, 74 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07fd68a2/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index a1dd1af..3fc756b 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -169,6 +169,7 @@ exportMethods("arrange",
   "transform",
   "union",
   "unionAll",
+  "unionByName",
   "unique",
   "unpersist",
   "where",

http://git-wip-us.apache.org/repos/asf/spark/blob/07fd68a2/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 80526cd..1b46c1e 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -2683,7 +2683,7 @@ generateAliasesForIntersectedCols <- function (x, 
intersectedColNames, suffix) {
 #' @rdname union
 #' @name union
 #' @aliases union,SparkDataFrame,SparkDataFrame-method
-#' @seealso \link{rbind}
+#' @seealso \link{rbind} \link{unionByName}
 #' @export
 #' @examples
 #'\dontrun{
@@ -2714,6 +2714,40 @@ setMethod("unionAll",
 union(x, y)
   })
 
+#' Return a new SparkDataFrame containing the union of rows, matched by column 
names
+#'
+#' Return a new SparkDataFrame containing the union of rows in this 
SparkDataFrame
+#' and another SparkDataFrame. This is different from \code{union} function, 
and both
+#' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are 
not taken
+#' into account. Input SparkDataFrames can have different data types in the 
schema.
+#'
+#' Note: This does not remove duplicate rows across the two SparkDataFrames.
+#' This function resolves columns by name (not by position).
+#'
+#' @param x A SparkDataFrame
+#' @param y A SparkDataFrame
+#' @return A SparkDataFrame containing the result of the union.
+#' @family SparkDataFrame functions
+#' @rdname unionByName
+#' @name unionByName
+#' @aliases unionByName,SparkDataFrame,SparkDataFrame-method
+#' @seealso \link{rbind} \link{union}
+#' @export
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
+#' df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
+#' head(unionByName(df1, df2))
+#' }
+#' @note unionByName since 2.3.0
+setMethod("unionByName",
+  signature(x = "SparkDataFrame", y = "SparkDataFrame"),
+  function(x, y) {
+unioned <- callJMethod(x@sdf, "unionByName", y@sdf)
+dataFrame(unioned)
+  })
+
 #' Union two or more SparkDataFrames
 #'
 #' Union two or more SparkDataFrames by row. As in R's \code{rbind}, this 
method
@@ -2730,7 +2764,7 @@ setMethod("unionAll",
 #' @aliases rbind,SparkDataFrame-method
 #' @rdname rbind
 #' @name rbind
-#' @seealso \link{union}
+#' @seealso \link{union} \link{unionByName}
 #' @export
 #' @examples
 #'\dontrun{


spark git commit: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python

2017-08-31 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f5e10a34e -> 5cd8ea99f


[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python

## What changes were proposed in this pull request?

This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting 
`False`, consistently with equivalent Scala / Java API.

In short, the following examples are allowed:

```python
>>> df = spark.range(10)
>>> df.sample(0.5).count()
7
>>> df.sample(fraction=0.5).count()
3
>>> df.sample(0.5, seed=42).count()
5
>>> df.sample(fraction=0.5, seed=42).count()
5
```

In addition, this PR also adds some type checking logics as below:

```python
>>> df = spark.range(10)
>>> df.sample().count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) 
should be a bool, float and number; however, got [].
>>> df.sample(True).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) 
should be a bool, float and number; however, got [].
>>> df.sample(42).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) 
should be a bool, float and number; however, got [].
>>> df.sample(fraction=False, seed="a").count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) 
should be a bool, float and number; however, got [, ].
>>> df.sample(seed=[1]).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) 
should be a bool, float and number; however, got [].
>>> df.sample(withReplacement="a", fraction=0.5, seed=1)
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) 
should be a bool, float and number; however, got [, , 
].
```

## How was this patch tested?

Manually tested, unit tests added in doc tests and manually checked the built 
documentation for Python.

Author: hyukjinkwon 

Closes #18999 from HyukjinKwon/SPARK-21779.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5cd8ea99
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5cd8ea99
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5cd8ea99

Branch: refs/heads/master
Commit: 5cd8ea99f084bee40ee18a0c8e33d0ca0aa6bb60
Parents: f5e10a3
Author: hyukjinkwon 
Authored: Fri Sep 1 13:01:23 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Sep 1 13:01:23 2017 +0900

--
 python/pyspark/sql/dataframe.py | 64 +---
 python/pyspark/sql/tests.py | 18 ++
 .../scala/org/apache/spark/sql/Dataset.scala|  3 +-
 3 files changed, 77 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5cd8ea99/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index d1b2a9c..c19e599 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -659,19 +659,69 @@ class DataFrame(object):
 return DataFrame(self._jdf.distinct(), self.sql_ctx)
 
 @since(1.3)
-def sample(self, withReplacement, fraction, seed=None):
+def sample(self, withReplacement=None, fraction=None, seed=None):
 """Returns a sampled subset of this :class:`DataFrame`.
 
+:param withReplacement: Sample with replacement or not (default False).
+:param fraction: Fraction of rows to generate, range [0.0, 1.0].
+:param seed: Seed for sampling (default a random seed).
+
 .. note:: This is not guaranteed to provide exactly the fraction 
specified of the total
 count of the given :class:`DataFrame`.
 
->>> df.sample(False, 0.5, 42).count()
-2
+.. note:: `fraction` is required and, `withReplacement` and `seed` are 
optional.
+
+>>> df = spark.range(10)
+>>> df.sample(0.5, 3).count()
+4
+>>> df.sample(fraction=0.5, seed=3).count()
+4
+>>> df.sample(withReplacement=True, fraction=0.5, seed=3).count()
+1
+>>> df.sample(1.0).count()
+10
+>>> df.sample(fraction=1.0).count()
+10
+>>> df.sample(False, fraction=1.0).count()
+10
 """
-assert fraction >= 0.0, "Negative fraction value: %s" % fraction
-seed = seed if seed is not None else random.randint(0, sys.maxsize)
-rdd = self._jdf.sample(withReplacement, fraction, long(seed))
-return DataFrame(rdd, self.sql_ctx)
+
+# For the cases below:
+#   sample(True, 0.5 [, seed])
+#   sample(True, fraction=0.5 [, seed])
+#   sample(withReplacement=False, fraction=0.5 [, seed])
+is_withReplacement_set = \
+type(withReplacement) == 

spark git commit: [SPARK-21789][PYTHON] Remove obsolete codes for parsing abstract schema strings

2017-08-31 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 5cd8ea99f -> 648a8626b


[SPARK-21789][PYTHON] Remove obsolete codes for parsing abstract schema strings

## What changes were proposed in this pull request?

This PR proposes to remove private functions that look not used in the main 
codes, `_split_schema_abstract`, `_parse_field_abstract`, 
`_parse_schema_abstract` and `_infer_schema_type`.

## How was this patch tested?

Existing tests.

Author: hyukjinkwon 

Closes #18647 from HyukjinKwon/remove-abstract.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/648a8626
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/648a8626
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/648a8626

Branch: refs/heads/master
Commit: 648a8626b82d27d84db3e48bccfd73d020828586
Parents: 5cd8ea9
Author: hyukjinkwon 
Authored: Fri Sep 1 13:09:24 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Sep 1 13:09:24 2017 +0900

--
 python/pyspark/sql/tests.py |  10 ---
 python/pyspark/sql/types.py | 129 ---
 2 files changed, 139 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/648a8626/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index a2a3ceb..3d87ccf 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -894,16 +894,6 @@ class SQLTests(ReusedPySparkTestCase):
 
 self.assertEqual((126, -127, -32767, 32766, 2147483646, 2.5), tuple(r))
 
-from pyspark.sql.types import _parse_schema_abstract, 
_infer_schema_type
-rdd = self.sc.parallelize([(127, -32768, 1.0, datetime(2010, 1, 1, 1, 
1, 1),
-{"a": 1}, (2,), [1, 2, 3])])
-abstract = "byte1 short1 float1 time1 map1{} struct1(b) list1[]"
-schema = _parse_schema_abstract(abstract)
-typedSchema = _infer_schema_type(rdd.first(), schema)
-df = self.spark.createDataFrame(rdd, typedSchema)
-r = (127, -32768, 1.0, datetime(2010, 1, 1, 1, 1, 1), {"a": 1}, 
Row(b=2), [1, 2, 3])
-self.assertEqual(r, tuple(df.first()))
-
 def test_struct_in_map(self):
 d = [Row(m={Row(i=1): Row(s="")})]
 df = self.sc.parallelize(d).toDF()

http://git-wip-us.apache.org/repos/asf/spark/blob/648a8626/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index ecb8eb9..51bf7be 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1187,135 +1187,6 @@ def _create_converter(dataType):
 return convert_struct
 
 
-def _split_schema_abstract(s):
-"""
-split the schema abstract into fields
-
->>> _split_schema_abstract("a b  c")
-['a', 'b', 'c']
->>> _split_schema_abstract("a(a b)")
-['a(a b)']
->>> _split_schema_abstract("a b[] c{a b}")
-['a', 'b[]', 'c{a b}']
->>> _split_schema_abstract(" ")
-[]
-"""
-
-r = []
-w = ''
-brackets = []
-for c in s:
-if c == ' ' and not brackets:
-if w:
-r.append(w)
-w = ''
-else:
-w += c
-if c in _BRACKETS:
-brackets.append(c)
-elif c in _BRACKETS.values():
-if not brackets or c != _BRACKETS[brackets.pop()]:
-raise ValueError("unexpected " + c)
-
-if brackets:
-raise ValueError("brackets not closed: %s" % brackets)
-if w:
-r.append(w)
-return r
-
-
-def _parse_field_abstract(s):
-"""
-Parse a field in schema abstract
-
->>> _parse_field_abstract("a")
-StructField(a,NullType,true)
->>> _parse_field_abstract("b(c d)")
-StructField(b,StructType(...c,NullType,true),StructField(d...
->>> _parse_field_abstract("a[]")
-StructField(a,ArrayType(NullType,true),true)
->>> _parse_field_abstract("a{[]}")
-StructField(a,MapType(NullType,ArrayType(NullType,true),true),true)
-"""
-if set(_BRACKETS.keys()) & set(s):
-idx = min((s.index(c) for c in _BRACKETS if c in s))
-name = s[:idx]
-return StructField(name, _parse_schema_abstract(s[idx:]), True)
-else:
-return StructField(s, NullType(), True)
-
-
-def _parse_schema_abstract(s):
-"""
-parse abstract into schema
-
->>> _parse_schema_abstract("a b  c")
-StructType...a...b...c...
->>> _parse_schema_abstract("a[b c] b{}")
-StructType...a,ArrayType...b...c...b,MapType...
->>> _parse_schema_abstract("c{} d{a b}")
-StructType...c,MapType...d,MapType...a...b...
->>> 

spark git commit: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-05 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 4e7a29efd -> 7f3c6ff4f


[SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

## What changes were proposed in this pull request?

1.0.0 fixes an issue with import order, explicit type for public methods, line 
length limitation and comment validation:

```
[error] 
.../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16:
 Are you sure you want to println? If yes, wrap the code block with
[error]   // scalastyle:off println
[error]   println(...)
[error]   // scalastyle:on println
[error] 
.../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49:
 File line length exceeds 100 characters
[error] 
.../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21:
 Are you sure you want to println? If yes, wrap the code block with
[error]   // scalastyle:off println
[error]   println(...)
[error]   // scalastyle:on println
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2:
 Insert a space after the start of the comment
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43:
 JavaDStream should come before JavaDStreamLike.
```

This PR also fixes the workaround added in SPARK-16877 for 
`org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0.

## How was this patch tested?

Manually tested.

Author: hyukjinkwon 

Closes #19116 from HyukjinKwon/scalastyle-1.0.0.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f3c6ff4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f3c6ff4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f3c6ff4

Branch: refs/heads/master
Commit: 7f3c6ff4ff0a501cc7f1fb53a90ea7b5787f68e1
Parents: 4e7a29e
Author: hyukjinkwon 
Authored: Tue Sep 5 19:40:05 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Sep 5 19:40:05 2017 +0900

--
 project/SparkBuild.scala|  5 +++--
 project/plugins.sbt |  3 +--
 .../src/main/scala/org/apache/spark/repl/Main.scala |  2 ++
 .../main/scala/org/apache/spark/repl/SparkILoop.scala   |  5 -
 scalastyle-config.xml   |  5 +
 .../java/org/apache/spark/streaming/JavaTestUtils.scala | 12 ++--
 6 files changed, 17 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7f3c6ff4/project/SparkBuild.scala
--
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 9d903ed..20848f0 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -163,14 +163,15 @@ object SparkBuild extends PomBuild {
 val configUrlV = scalastyleConfigUrl.in(config).value
 val streamsV = streams.in(config).value
 val failOnErrorV = true
+val failOnWarningV = false
 val scalastyleTargetV = scalastyleTarget.in(config).value
 val configRefreshHoursV = scalastyleConfigRefreshHours.in(config).value
 val targetV = target.in(config).value
 val configCacheFileV = scalastyleConfigUrlCacheFile.in(config).value
 
 logger.info(s"Running scalastyle on ${name.value} in ${config.name}")
-Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, 
scalaSourceV, scalastyleTargetV,
-  streamsV, configRefreshHoursV, targetV, configCacheFileV)
+Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, 
failOnWarningV, scalaSourceV,
+  scalastyleTargetV, streamsV, configRefreshHoursV, targetV, 
configCacheFileV)
 
 Set.empty
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/7f3c6ff4/project/plugins.sbt
--
diff --git a/project/plugins.sbt b/project/plugins.sbt
index f67e0a1..3c5442b 100644
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -7,8 +7,7 @@ addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % 
"5.1.0")
 // sbt 1.0.0 support: 
https://github.com/jrudolph/sbt-dependency-graph/issues/134
 

spark git commit: [SPARK-20886][CORE] HadoopMapReduceCommitProtocol to handle FileOutputCommitter.getWorkPath==null

2017-08-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 3d0e17424 -> e47f48c73


[SPARK-20886][CORE] HadoopMapReduceCommitProtocol to handle 
FileOutputCommitter.getWorkPath==null

## What changes were proposed in this pull request?

Handles the situation where a `FileOutputCommitter.getWorkPath()` returns 
`null` by downgrading to the supplied `path` argument.

The existing code does an  `Option(workPath.toString).getOrElse(path)`, which 
triggers an NPE in the `toString()` operation if the workPath == null. The code 
apparently was meant to handle this (hence the getOrElse() clause, but as the 
NPE has already occurred at that point the else-clause never gets invoked.

## How was this patch tested?

Manually, with some later code review.

Author: Steve Loughran 

Closes #18111 from steveloughran/cloud/SPARK-20886-committer-NPE.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e47f48c7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e47f48c7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e47f48c7

Branch: refs/heads/master
Commit: e47f48c737052564e92903de16ff16707fae32c3
Parents: 3d0e174
Author: Steve Loughran 
Authored: Wed Aug 30 13:03:30 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Aug 30 13:03:30 2017 +0900

--
 .../apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala  | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e47f48c7/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
index 22e2679..b1d07ab 100644
--- 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
+++ 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
@@ -73,7 +73,8 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: 
String)
 
 val stagingDir: String = committer match {
   // For FileOutputCommitter it has its own staging path called "work 
path".
-  case f: FileOutputCommitter => 
Option(f.getWorkPath.toString).getOrElse(path)
+  case f: FileOutputCommitter =>
+Option(f.getWorkPath).map(_.toString).getOrElse(path)
   case _ => path
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from python row with empty bytearray

2017-08-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 4482ff23a -> ecf437a64


[SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from python 
row with empty bytearray

## What changes were proposed in this pull request?

`PickleException` is thrown when creating dataframe from python row with empty 
bytearray

spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
{"abc": x.xx})).show()

net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
expected 1 or 2 args, got 0
at 
net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java
...

`ByteArrayConstructor` doesn't deal with empty byte array pickled by Python3.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh 

Closes #19085 from viirya/SPARK-21534.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ecf437a6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ecf437a6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ecf437a6

Branch: refs/heads/master
Commit: ecf437a64874a31328f4e28c6b24f37557fbe07d
Parents: 4482ff2
Author: Liang-Chi Hsieh 
Authored: Thu Aug 31 12:55:38 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Aug 31 12:55:38 2017 +0900

--
 .../scala/org/apache/spark/api/python/SerDeUtil.scala | 14 ++
 python/pyspark/sql/tests.py   |  4 +++-
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ecf437a6/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
index aaf8e7a..01e64b6 100644
--- a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
@@ -35,6 +35,16 @@ import org.apache.spark.rdd.RDD
 
 /** Utilities for serialization / deserialization between Python and Java, 
using Pickle. */
 private[spark] object SerDeUtil extends Logging {
+  class ByteArrayConstructor extends 
net.razorvine.pickle.objects.ByteArrayConstructor {
+override def construct(args: Array[Object]): Object = {
+  // Deal with an empty byte array pickled by Python 3.
+  if (args.length == 0) {
+Array.emptyByteArray
+  } else {
+super.construct(args)
+  }
+}
+  }
   // Unpickle array.array generated by Python 2.6
   class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor 
{
 //  /* Description of types */
@@ -108,6 +118,10 @@ private[spark] object SerDeUtil extends Logging {
 synchronized{
   if (!initialized) {
 Unpickler.registerConstructor("array", "array", new ArrayConstructor())
+Unpickler.registerConstructor("__builtin__", "bytearray", new 
ByteArrayConstructor())
+Unpickler.registerConstructor("builtins", "bytearray", new 
ByteArrayConstructor())
+Unpickler.registerConstructor("__builtin__", "bytes", new 
ByteArrayConstructor())
+Unpickler.registerConstructor("_codecs", "encode", new 
ByteArrayConstructor())
 initialized = true
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/ecf437a6/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 1ecde68..b310285 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -2383,9 +2383,11 @@ class SQLTests(ReusedPySparkTestCase):
 
 def test_BinaryType_serialization(self):
 # Pyrolite version <= 4.9 could not serialize BinaryType with Python3 
SPARK-17808
+# The empty bytearray is test for SPARK-21534.
 schema = StructType([StructField('mybytes', BinaryType())])
 data = [[bytearray(b'here is my data')],
-[bytearray(b'and here is some more')]]
+[bytearray(b'and here is some more')],
+[bytearray(b'')]]
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters

2017-10-12 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 02218c4c7 -> 9104add4c


[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters

## What changes were proposed in this pull request?

`ParquetFileFormat` to relax its requirement of output committer class from 
`org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so 
implicitly Hadoop `FileOutputCommitter`) to any committer implementing 
`org.apache.hadoop.mapreduce.OutputCommitter`

This enables output committers which don't write to the filesystem the way 
`FileOutputCommitter` does to save parquet data from a dataframe: at present 
you cannot do this.

Before a committer which isn't a subclass of `ParquetOutputCommitter`, it 
checks to see if the context has requested summary metadata by setting 
`parquet.enable.summary-metadata`. If true, and the committer class isn't a 
parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there 
won't be an summary. It also makes the behaviour testable.)

Note that `SQLConf` already states that any `OutputCommitter` can be used, but 
that typically it's a subclass of ParquetOutputCommitter. That's not currently 
true. This patch will make the code consistent with the docs, adding tests to 
verify,

## How was this patch tested?

The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, 
`MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a 
marker file in the destination directory. The presence of the marker file can 
be used to verify the new committer was used. The tests then try the 
combinations of Parquet committer summary/no-summary and marking committer 
summary/no-summary.

| committer | summary | outcome |
|---|-|-|
| parquet   | true| success |
| parquet   | false   | success |
| marking   | false   | success with marker |
| marking   | true| exception |

All tests are happy.

Author: Steve Loughran 

Closes #19448 from steveloughran/cloud/SPARK-22217-committer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9104add4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9104add4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9104add4

Branch: refs/heads/master
Commit: 9104add4c7c6b578df15b64a8533a1266f90734e
Parents: 02218c4
Author: Steve Loughran 
Authored: Fri Oct 13 08:40:26 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Oct 13 08:40:26 2017 +0900

--
 .../org/apache/spark/sql/internal/SQLConf.scala |   5 +-
 .../datasources/parquet/ParquetFileFormat.scala |  12 +-
 .../parquet/ParquetCommitterSuite.scala | 152 +++
 3 files changed, 165 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9104add4/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 5832374..618d4a0 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -306,8 +306,9 @@ object SQLConf {
 
   val PARQUET_OUTPUT_COMMITTER_CLASS = 
buildConf("spark.sql.parquet.output.committer.class")
 .doc("The output committer class used by Parquet. The specified class 
needs to be a " +
-  "subclass of org.apache.hadoop.mapreduce.OutputCommitter.  Typically, 
it's also a subclass " +
-  "of org.apache.parquet.hadoop.ParquetOutputCommitter.")
+  "subclass of org.apache.hadoop.mapreduce.OutputCommitter. Typically, 
it's also a subclass " +
+  "of org.apache.parquet.hadoop.ParquetOutputCommitter. If it is not, then 
metadata summaries" +
+  "will never be created, irrespective of the value of 
parquet.enable.summary-metadata")
 .internal()
 .stringConf
 .createWithDefault("org.apache.parquet.hadoop.ParquetOutputCommitter")

http://git-wip-us.apache.org/repos/asf/spark/blob/9104add4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index e1e7405..c1535ba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 

spark git commit: [SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters

2017-10-12 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 cd51e2c32 -> cfc04e062


[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters

## What changes were proposed in this pull request?

`ParquetFileFormat` to relax its requirement of output committer class from 
`org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so 
implicitly Hadoop `FileOutputCommitter`) to any committer implementing 
`org.apache.hadoop.mapreduce.OutputCommitter`

This enables output committers which don't write to the filesystem the way 
`FileOutputCommitter` does to save parquet data from a dataframe: at present 
you cannot do this.

Before a committer which isn't a subclass of `ParquetOutputCommitter`, it 
checks to see if the context has requested summary metadata by setting 
`parquet.enable.summary-metadata`. If true, and the committer class isn't a 
parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there 
won't be an summary. It also makes the behaviour testable.)

Note that `SQLConf` already states that any `OutputCommitter` can be used, but 
that typically it's a subclass of ParquetOutputCommitter. That's not currently 
true. This patch will make the code consistent with the docs, adding tests to 
verify,

## How was this patch tested?

The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, 
`MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a 
marker file in the destination directory. The presence of the marker file can 
be used to verify the new committer was used. The tests then try the 
combinations of Parquet committer summary/no-summary and marking committer 
summary/no-summary.

| committer | summary | outcome |
|---|-|-|
| parquet   | true| success |
| parquet   | false   | success |
| marking   | false   | success with marker |
| marking   | true| exception |

All tests are happy.

Author: Steve Loughran 

Closes #19448 from steveloughran/cloud/SPARK-22217-committer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cfc04e06
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cfc04e06
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cfc04e06

Branch: refs/heads/branch-2.2
Commit: cfc04e062b4f3b14d5b846f06c9c85bb2e21cf0a
Parents: cd51e2c
Author: Steve Loughran 
Authored: Fri Oct 13 08:40:26 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Oct 13 10:00:33 2017 +0900

--
 .../org/apache/spark/sql/internal/SQLConf.scala |   5 +-
 .../datasources/parquet/ParquetFileFormat.scala |  12 +-
 .../parquet/ParquetCommitterSuite.scala | 152 +++
 3 files changed, 165 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cfc04e06/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 79398fb..4c29f8e 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -268,8 +268,9 @@ object SQLConf {
 
   val PARQUET_OUTPUT_COMMITTER_CLASS = 
buildConf("spark.sql.parquet.output.committer.class")
 .doc("The output committer class used by Parquet. The specified class 
needs to be a " +
-  "subclass of org.apache.hadoop.mapreduce.OutputCommitter.  Typically, 
it's also a subclass " +
-  "of org.apache.parquet.hadoop.ParquetOutputCommitter.")
+  "subclass of org.apache.hadoop.mapreduce.OutputCommitter. Typically, 
it's also a subclass " +
+  "of org.apache.parquet.hadoop.ParquetOutputCommitter. If it is not, then 
metadata summaries" +
+  "will never be created, irrespective of the value of 
parquet.enable.summary-metadata")
 .internal()
 .stringConf
 .createWithDefault("org.apache.parquet.hadoop.ParquetOutputCommitter")

http://git-wip-us.apache.org/repos/asf/spark/blob/cfc04e06/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index 87fbf8b..1d60495 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 

spark git commit: [SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator

2017-10-18 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 010b50cea -> f8c83fdc5


[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator

Backport of https://github.com/apache/spark/pull/18752 
(https://issues.apache.org/jira/browse/SPARK-21551)

(cherry picked from commit 9d3c6640f56e3e4fd195d3ad8cead09df67a72c7)

Author: peay 

Closes #19512 from FRosner/branch-2.2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f8c83fdc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f8c83fdc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f8c83fdc

Branch: refs/heads/branch-2.2
Commit: f8c83fdc52ba9120098e52a35085448150af6b50
Parents: 010b50c
Author: peay 
Authored: Thu Oct 19 13:07:04 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 19 13:07:04 2017 +0900

--
 .../src/main/scala/org/apache/spark/api/python/PythonRDD.scala | 6 +++---
 python/pyspark/rdd.py  | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f8c83fdc/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
index b0dd2fc..807b51f 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
@@ -683,7 +683,7 @@ private[spark] object PythonRDD extends Logging {
* Create a socket server and a background thread to serve the data in 
`items`,
*
* The socket server can only accept one connection, or close if no 
connection
-   * in 3 seconds.
+   * in 15 seconds.
*
* Once a connection comes in, it tries to serialize all the data in `items`
* and send them into this connection.
@@ -692,8 +692,8 @@ private[spark] object PythonRDD extends Logging {
*/
   def serveIterator[T](items: Iterator[T], threadName: String): Int = {
 val serverSocket = new ServerSocket(0, 1, 
InetAddress.getByName("localhost"))
-// Close the socket if no connection in 3 seconds
-serverSocket.setSoTimeout(3000)
+// Close the socket if no connection in 15 seconds
+serverSocket.setSoTimeout(15000)
 
 new Thread(threadName) {
   setDaemon(true)

http://git-wip-us.apache.org/repos/asf/spark/blob/f8c83fdc/python/pyspark/rdd.py
--
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 6014179..aca00bc 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -127,7 +127,7 @@ def _load_from_socket(port, serializer):
 af, socktype, proto, canonname, sa = res
 sock = socket.socket(af, socktype, proto)
 try:
-sock.settimeout(3)
+sock.settimeout(15)
 sock.connect(sa)
 except socket.error:
 sock.close()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs

2017-10-23 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 884d4f95f -> d9798c834


[SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for 
deprecated APIs

## What changes were proposed in this pull request?

This PR proposes to mark the existing warnings as `DeprecationWarning` and 
print out warnings for deprecated functions.

This could be actually useful for Spark app developers. I use (old) PyCharm and 
this IDE can detect this specific `DeprecationWarning` in some cases:

**Before**

https://user-images.githubusercontent.com/6477701/31762664-df68d9f8-b4f6-11e7-8773-f0468f70a2cc.png;
 height="45" />

**After**

https://user-images.githubusercontent.com/6477701/31762662-de4d6868-b4f6-11e7-98dc-3c8446a0c28a.png;
 height="70" />

For console usage, `DeprecationWarning` is usually disabled (see 
https://docs.python.org/2/library/warnings.html#warning-categories and 
https://docs.python.org/3/library/warnings.html#warning-categories):

```
>>> import warnings
>>> filter(lambda f: f[2] == DeprecationWarning, warnings.filters)
[('ignore', <_sre.SRE_Pattern object at 0x10ba58c00>, , <_sre.SRE_Pattern object at 0x10bb04138>, 0), 
('ignore', None, , None, 0)]
```

so, it won't actually mess up the terminal much unless it is intended.

If this is intendedly enabled, it'd should as below:

```
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
>>>
>>> from pyspark.sql import functions
>>> functions.approxCountDistinct("a")
.../spark/python/pyspark/sql/functions.py:232: DeprecationWarning: Deprecated 
in 2.1, use approx_count_distinct instead.
  "Deprecated in 2.1, use approx_count_distinct instead.", DeprecationWarning)
...
```

These instances were found by:

```
cd python/pyspark
grep -r "Deprecated" .
grep -r "deprecated" .
grep -r "deprecate" .
```

## How was this patch tested?

Manually tested.

Author: hyukjinkwon 

Closes #19535 from HyukjinKwon/deprecated-warning.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d9798c83
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d9798c83
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d9798c83

Branch: refs/heads/master
Commit: d9798c834f3fed060cfd18a8d38c398cb2efcc82
Parents: 884d4f9
Author: hyukjinkwon 
Authored: Tue Oct 24 12:44:47 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Oct 24 12:44:47 2017 +0900

--
 python/pyspark/ml/util.py  |  8 +++-
 python/pyspark/mllib/classification.py |  2 +-
 python/pyspark/mllib/evaluation.py |  6 +--
 python/pyspark/mllib/regression.py |  8 ++--
 python/pyspark/sql/dataframe.py|  3 ++
 python/pyspark/sql/functions.py| 18 
 python/pyspark/streaming/flume.py  | 14 +-
 python/pyspark/streaming/kafka.py  | 72 +
 8 files changed, 110 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d9798c83/python/pyspark/ml/util.py
--
diff --git a/python/pyspark/ml/util.py b/python/pyspark/ml/util.py
index 6777291..c3c47bd 100644
--- a/python/pyspark/ml/util.py
+++ b/python/pyspark/ml/util.py
@@ -175,7 +175,9 @@ class JavaMLWriter(MLWriter):
 
 .. note:: Deprecated in 2.1 and will be removed in 3.0, use session 
instead.
 """
-warnings.warn("Deprecated in 2.1 and will be removed in 3.0, use 
session instead.")
+warnings.warn(
+"Deprecated in 2.1 and will be removed in 3.0, use session 
instead.",
+DeprecationWarning)
 self._jwrite.context(sqlContext._ssql_ctx)
 return self
 
@@ -256,7 +258,9 @@ class JavaMLReader(MLReader):
 
 .. note:: Deprecated in 2.1 and will be removed in 3.0, use session 
instead.
 """
-warnings.warn("Deprecated in 2.1 and will be removed in 3.0, use 
session instead.")
+warnings.warn(
+"Deprecated in 2.1 and will be removed in 3.0, use session 
instead.",
+DeprecationWarning)
 self._jread.context(sqlContext._ssql_ctx)
 return self
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d9798c83/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index e04eeb2..cce703d 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -311,7 +311,7 @@ class LogisticRegressionWithSGD(object):
 """
 warnings.warn(
 "Deprecated in 2.0.0. Use ml.classification.LogisticRegression or "
-"LogisticRegressionWithLBFGS.")
+

spark git commit: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas

2017-11-12 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 3d90b2cb3 -> 209b9361a


[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas

## What changes were proposed in this pull request?

This change uses Arrow to optimize the creation of a Spark DataFrame from a 
Pandas DataFrame. The input df is sliced according to the default parallelism. 
The optimization is enabled with the existing conf 
"spark.sql.execution.arrow.enabled" and is disabled by default.

## How was this patch tested?

Added new unit test to create DataFrame with and without the optimization 
enabled, then compare results.

Author: Bryan Cutler 
Author: Takuya UESHIN 

Closes #19459 from BryanCutler/arrow-createDataFrame-from_pandas-SPARK-20791.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/209b9361
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/209b9361
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/209b9361

Branch: refs/heads/master
Commit: 209b9361ac8a4410ff797cff1115e1888e2f7e66
Parents: 3d90b2c
Author: Bryan Cutler 
Authored: Mon Nov 13 13:16:01 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Nov 13 13:16:01 2017 +0900

--
 python/pyspark/context.py   | 28 +++---
 python/pyspark/java_gateway.py  |  1 +
 python/pyspark/serializers.py   | 10 ++-
 python/pyspark/sql/session.py   | 88 +++
 python/pyspark/sql/tests.py | 89 +---
 python/pyspark/sql/types.py | 49 +++
 .../spark/sql/api/python/PythonSQLUtils.scala   | 18 
 .../sql/execution/arrow/ArrowConverters.scala   | 14 +++
 8 files changed, 254 insertions(+), 43 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/209b9361/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index a33f6dc..24905f1 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -475,24 +475,30 @@ class SparkContext(object):
 return xrange(getStart(split), getStart(split + 1), step)
 
 return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
-# Calling the Java parallelize() method with an ArrayList is too slow,
-# because it sends O(n) Py4J commands.  As an alternative, serialized
-# objects are written to a file and loaded through textFile().
+
+# Make sure we distribute data evenly if it's smaller than 
self.batchSize
+if "__len__" not in dir(c):
+c = list(c)# Make it a list so we can compute its length
+batchSize = max(1, min(len(c) // numSlices, self._batchSize or 1024))
+serializer = BatchedSerializer(self._unbatched_serializer, batchSize)
+jrdd = self._serialize_to_jvm(c, numSlices, serializer)
+return RDD(jrdd, self, serializer)
+
+def _serialize_to_jvm(self, data, parallelism, serializer):
+"""
+Calling the Java parallelize() method with an ArrayList is too slow,
+because it sends O(n) Py4J commands.  As an alternative, serialized
+objects are written to a file and loaded through textFile().
+"""
 tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir)
 try:
-# Make sure we distribute data evenly if it's smaller than 
self.batchSize
-if "__len__" not in dir(c):
-c = list(c)# Make it a list so we can compute its length
-batchSize = max(1, min(len(c) // numSlices, self._batchSize or 
1024))
-serializer = BatchedSerializer(self._unbatched_serializer, 
batchSize)
-serializer.dump_stream(c, tempFile)
+serializer.dump_stream(data, tempFile)
 tempFile.close()
 readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile
-jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
+return readRDDFromFile(self._jsc, tempFile.name, parallelism)
 finally:
 # readRDDFromFile eagerily reads the file so we can delete right 
after.
 os.unlink(tempFile.name)
-return RDD(jrdd, self, serializer)
 
 def pickleFile(self, name, minPartitions=None):
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/209b9361/python/pyspark/java_gateway.py
--
diff --git a/python/pyspark/java_gateway.py b/python/pyspark/java_gateway.py
index 3c783ae..3e704fe 100644
--- a/python/pyspark/java_gateway.py
+++ b/python/pyspark/java_gateway.py
@@ -121,6 +121,7 @@ def 

spark git commit: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions running R worker in few tests to speed up

2017-11-26 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master fba63c1a7 -> d49d9e403


[SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions running R worker in few 
tests to speed up

## What changes were proposed in this pull request?

This is a followup to reduce AppVeyor test time. This PR proposes to reduce the 
number of shuffle partitions to reduce the tasks running R workers in few 
particular tests.

The symptom is similar as described in 
`https://github.com/apache/spark/pull/19722`. There are many R processes newly 
launched on Windows without forking and it makes the differences of elapsed 
time between Linux and Windows.

Here is the simple comparison for before/after of this change. I manually 
tested this by disabling `spark.sparkr.use.daemon`. Disabling it resembles the 
tests on Windows:

**Before**

https://user-images.githubusercontent.com/6477701/33217949-b5528dfa-d17d-11e7-8050-75675c39eb20.png;>

**After**

https://user-images.githubusercontent.com/6477701/33217958-c6518052-d17d-11e7-9f8e-1be21a784559.png;>

So, this probably will reduce roughly more than 10 minutes.

## How was this patch tested?

AppVeyor tests

Author: hyukjinkwon 

Closes #19816 from HyukjinKwon/SPARK-21693-followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d49d9e40
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d49d9e40
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d49d9e40

Branch: refs/heads/master
Commit: d49d9e40383209eed9584a4ef2c3964f27f4a08f
Parents: fba63c1
Author: hyukjinkwon 
Authored: Mon Nov 27 10:09:53 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Nov 27 10:09:53 2017 +0900

--
 R/pkg/tests/fulltests/test_sparkSQL.R | 267 -
 1 file changed, 148 insertions(+), 119 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d49d9e40/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 00217c8..d87f5d2 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -3021,41 +3021,54 @@ test_that("dapplyCollect() on DataFrame with a binary 
column", {
 })
 
 test_that("repartition by columns on DataFrame", {
-  df <- createDataFrame(
-list(list(1L, 1, "1", 0.1), list(1L, 2, "2", 0.2), list(3L, 3, "3", 0.3)),
-c("a", "b", "c", "d"))
-
-  # no column and number of partitions specified
-  retError <- tryCatch(repartition(df), error = function(e) e)
-  expect_equal(grepl
-("Please, specify the number of partitions and/or a column\\(s\\)", 
retError), TRUE)
-
-  # repartition by column and number of partitions
-  actual <- repartition(df, 3, col = df$"a")
-
-  # Checking that at least the dimensions are identical
-  expect_identical(dim(df), dim(actual))
-  expect_equal(getNumPartitions(actual), 3L)
-
-  # repartition by number of partitions
-  actual <- repartition(df, 13L)
-  expect_identical(dim(df), dim(actual))
-  expect_equal(getNumPartitions(actual), 13L)
-
-  expect_equal(getNumPartitions(coalesce(actual, 1L)), 1L)
-
-  # a test case with a column and dapply
-  schema <-  structType(structField("a", "integer"), structField("avg", 
"double"))
-  df <- repartition(df, col = df$"a")
-  df1 <- dapply(
-df,
-function(x) {
-  y <- (data.frame(x$a[1], mean(x$b)))
-},
-schema)
+  # The tasks here launch R workers with shuffles. So, we decrease the number 
of shuffle
+  # partitions to reduce the number of the tasks to speed up the test. This is 
particularly
+  # slow on Windows because the R workers are unable to be forked. See also 
SPARK-21693.
+  conf <- callJMethod(sparkSession, "conf")
+  shufflepartitionsvalue <- callJMethod(conf, "get", 
"spark.sql.shuffle.partitions")
+  callJMethod(conf, "set", "spark.sql.shuffle.partitions", "5")
+  tryCatch({
+df <- createDataFrame(
+  list(list(1L, 1, "1", 0.1), list(1L, 2, "2", 0.2), list(3L, 3, "3", 
0.3)),
+  c("a", "b", "c", "d"))
+
+# no column and number of partitions specified
+retError <- tryCatch(repartition(df), error = function(e) e)
+expect_equal(grepl
+  ("Please, specify the number of partitions and/or a column\\(s\\)", 
retError), TRUE)
+
+# repartition by column and number of partitions
+actual <- repartition(df, 3, col = df$"a")
+
+# Checking that at least the dimensions are identical
+expect_identical(dim(df), dim(actual))
+expect_equal(getNumPartitions(actual), 3L)
+
+# repartition by number of partitions
+actual <- repartition(df, 13L)
+expect_identical(dim(df), dim(actual))
+expect_equal(getNumPartitions(actual), 13L)
+
+

spark git commit: [SPARK-22495] Fix setup of SPARK_HOME variable on Windows

2017-11-22 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 1edb3175d -> b4edafa99


[SPARK-22495] Fix setup of SPARK_HOME variable on Windows

## What changes were proposed in this pull request?

Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous 
version was working with the built release download, the set of directories 
changed slightly for the PySpark `pip` or `conda` install. This has been 
reflected in Linux files in `bin` but not for Windows `cmd` files.

First fix improves the way how the `jars` directory is found, as this was 
stoping Windows version of `pip/conda` install from working; JARs were not 
found by on Session/Context setup.

Second fix is adding `find-spark-home.cmd` script, which uses 
`find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It 
is based on `find-spark-home` bash script, though, some operations are done in 
different order due to the `cmd` script language limitations. If environment 
variable is set, the Python script `find_spark_home.py` will not be run. The 
process can fail if Python is not installed, but it will mostly use this way if 
PySpark is installed via `pip/conda`, thus, there is some Python in the system.

## How was this patch tested?

Tested on local installation.

Author: Jakub Nowacki 

Closes #19370 from jsnowacki/fix_spark_cmds.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4edafa9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4edafa9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4edafa9

Branch: refs/heads/master
Commit: b4edafa99bd3858c166adeefdafd93dcd4bc9734
Parents: 1edb317
Author: Jakub Nowacki 
Authored: Thu Nov 23 12:47:38 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Nov 23 12:47:38 2017 +0900

--
 appveyor.yml|  1 +
 bin/find-spark-home.cmd | 60 
 bin/pyspark2.cmd|  2 +-
 bin/run-example.cmd |  4 ++-
 bin/spark-class2.cmd|  2 +-
 bin/spark-shell2.cmd|  4 ++-
 bin/sparkR2.cmd |  2 +-
 7 files changed, 70 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b4edafa9/appveyor.yml
--
diff --git a/appveyor.yml b/appveyor.yml
index dc2d81f..4874092 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -33,6 +33,7 @@ only_commits:
 - core/src/main/scala/org/apache/spark/api/r/
 - mllib/src/main/scala/org/apache/spark/ml/r/
 - core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
+- bin/*.cmd
 
 cache:
   - C:\Users\appveyor\.m2

http://git-wip-us.apache.org/repos/asf/spark/blob/b4edafa9/bin/find-spark-home.cmd
--
diff --git a/bin/find-spark-home.cmd b/bin/find-spark-home.cmd
new file mode 100644
index 000..c75e7ee
--- /dev/null
+++ b/bin/find-spark-home.cmd
@@ -0,0 +1,60 @@
+@echo off
+
+rem
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements.  See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License.  You may obtain a copy of the License at
+rem
+remhttp://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+rem
+
+rem Path to Python script finding SPARK_HOME
+set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py
+
+rem Default to standard python interpreter unless told otherwise
+set PYTHON_RUNNER=python
+rem If PYSPARK_DRIVER_PYTHON is set, it overwrites the python version
+if not "x%PYSPARK_DRIVER_PYTHON%"=="x" (
+  set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON%
+)
+rem If PYSPARK_PYTHON is set, it overwrites the python version
+if not "x%PYSPARK_PYTHON%"=="x" (
+  set PYTHON_RUNNER=%PYSPARK_PYTHON%
+)
+
+rem If there is python installed, trying to use the root dir as SPARK_HOME
+where %PYTHON_RUNNER% > nul 2>$1
+if %ERRORLEVEL% neq 0 (
+  if not exist %PYTHON_RUNNER% (
+if "x%SPARK_HOME%"=="x" (
+  echo Missing Python executable '%PYTHON_RUNNER%', defaulting to 
'%~dp0..' for SPARK_HOME ^
+environment variable. Please install Python or specify the correct Python 
executable in ^
+PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment 

spark git commit: [SPARK-22572][SPARK SHELL] spark-shell does not re-initialize on :replay

2017-11-22 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 572af5027 -> 327d25fe1


[SPARK-22572][SPARK SHELL] spark-shell does not re-initialize on :replay

## What changes were proposed in this pull request?

Ticket: [SPARK-22572](https://issues.apache.org/jira/browse/SPARK-22572)

## How was this patch tested?

Added a new test case to `org.apache.spark.repl.ReplSuite`

Author: Mark Petruska 

Closes #19791 from mpetruska/SPARK-22572.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/327d25fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/327d25fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/327d25fe

Branch: refs/heads/master
Commit: 327d25fe1741f62cd84097e94739f82ecb05383a
Parents: 572af50
Author: Mark Petruska 
Authored: Wed Nov 22 21:35:47 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Nov 22 21:35:47 2017 +0900

--
 .../org/apache/spark/repl/SparkILoop.scala  | 75 +++-
 .../org/apache/spark/repl/SparkILoop.scala  | 74 +++
 .../scala/org/apache/spark/repl/ReplSuite.scala | 10 +++
 3 files changed, 96 insertions(+), 63 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/327d25fe/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala
--
diff --git 
a/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala 
b/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala
index ea279e4..3ce7cc7 100644
--- a/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala
+++ b/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala
@@ -35,40 +35,45 @@ class SparkILoop(in0: Option[BufferedReader], out: 
JPrintWriter)
   def this(in0: BufferedReader, out: JPrintWriter) = this(Some(in0), out)
   def this() = this(None, new JPrintWriter(Console.out, true))
 
+  val initializationCommands: Seq[String] = Seq(
+"""
+@transient val spark = if (org.apache.spark.repl.Main.sparkSession != 
null) {
+org.apache.spark.repl.Main.sparkSession
+  } else {
+org.apache.spark.repl.Main.createSparkSession()
+  }
+@transient val sc = {
+  val _sc = spark.sparkContext
+  if (_sc.getConf.getBoolean("spark.ui.reverseProxy", false)) {
+val proxyUrl = _sc.getConf.get("spark.ui.reverseProxyUrl", null)
+if (proxyUrl != null) {
+  println(
+s"Spark Context Web UI is available at 
${proxyUrl}/proxy/${_sc.applicationId}")
+} else {
+  println(s"Spark Context Web UI is available at Spark Master Public 
URL")
+}
+  } else {
+_sc.uiWebUrl.foreach {
+  webUrl => println(s"Spark context Web UI available at ${webUrl}")
+}
+  }
+  println("Spark context available as 'sc' " +
+s"(master = ${_sc.master}, app id = ${_sc.applicationId}).")
+  println("Spark session available as 'spark'.")
+  _sc
+}
+""",
+"import org.apache.spark.SparkContext._",
+"import spark.implicits._",
+"import spark.sql",
+"import org.apache.spark.sql.functions._"
+  )
+
   def initializeSpark() {
 intp.beQuietDuring {
-  processLine("""
-@transient val spark = if (org.apache.spark.repl.Main.sparkSession != 
null) {
-org.apache.spark.repl.Main.sparkSession
-  } else {
-org.apache.spark.repl.Main.createSparkSession()
-  }
-@transient val sc = {
-  val _sc = spark.sparkContext
-  if (_sc.getConf.getBoolean("spark.ui.reverseProxy", false)) {
-val proxyUrl = _sc.getConf.get("spark.ui.reverseProxyUrl", null)
-if (proxyUrl != null) {
-  println(
-s"Spark Context Web UI is available at 
${proxyUrl}/proxy/${_sc.applicationId}")
-} else {
-  println(s"Spark Context Web UI is available at Spark Master 
Public URL")
-}
-  } else {
-_sc.uiWebUrl.foreach {
-  webUrl => println(s"Spark context Web UI available at ${webUrl}")
-}
-  }
-  println("Spark context available as 'sc' " +
-s"(master = ${_sc.master}, app id = ${_sc.applicationId}).")
-  println("Spark session available as 'spark'.")
-  _sc
-}
-""")
-  processLine("import org.apache.spark.SparkContext._")
-  processLine("import spark.implicits._")
-  processLine("import spark.sql")
-  processLine("import org.apache.spark.sql.functions._")
-  replayCommandStack = Nil // remove above commands from session history.
+  savingReplayStack { // remove the commands from session history.
+

spark git commit: [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite

2017-11-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 38a0532cf -> d7b14746d


[SPARK-22654][TESTS] Retry Spark tarball download if failed in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

Adds a simple loop to retry download of Spark tarballs from different mirrors 
if the download fails.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #19851 from srowen/SPARK-22654.

(cherry picked from commit 6eb203fae7bbc9940710da40f314b89ffb4dd324)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7b14746
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7b14746
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7b14746

Branch: refs/heads/branch-2.2
Commit: d7b14746dd9bd488240174446bd158be1e30c250
Parents: 38a0532
Author: Sean Owen 
Authored: Fri Dec 1 01:21:52 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 01:22:06 2017 +0900

--
 .../hive/HiveExternalCatalogVersionsSuite.scala | 24 +++-
 1 file changed, 18 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7b14746/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 6859432..a3d5b94 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -20,6 +20,8 @@ package org.apache.spark.sql.hive
 import java.io.File
 import java.nio.file.Files
 
+import scala.sys.process._
+
 import org.apache.spark.TestUtils
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
@@ -50,14 +52,24 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 super.afterAll()
   }
 
-  private def downloadSpark(version: String): Unit = {
-import scala.sys.process._
+  private def tryDownloadSpark(version: String, path: String): Unit = {
+// Try mirrors a few times until one succeeds
+for (i <- 0 until 3) {
+  val preferredMirror =
+Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
+  val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
+  logInfo(s"Downloading Spark $version from $url")
+  if (Seq("wget", url, "-q", "-P", path).! == 0) {
+return
+  }
+  logWarning(s"Failed to download Spark $version from $url")
+}
+fail(s"Unable to download Spark $version")
+  }
 
-val preferredMirror =
-  Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
-val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
 
-Seq("wget", url, "-q", "-P", sparkTestingDir.getCanonicalPath).!
+  private def downloadSpark(version: String): Unit = {
+tryDownloadSpark(version, sparkTestingDir.getCanonicalPath)
 
 val downloaded = new File(sparkTestingDir, 
s"spark-$version-bin-hadoop2.7.tgz").getCanonicalPath
 val targetDir = new File(sparkTestingDir, 
s"spark-$version").getCanonicalPath


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite

2017-11-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 9c29c5576 -> 6eb203fae


[SPARK-22654][TESTS] Retry Spark tarball download if failed in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

Adds a simple loop to retry download of Spark tarballs from different mirrors 
if the download fails.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #19851 from srowen/SPARK-22654.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6eb203fa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6eb203fa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6eb203fa

Branch: refs/heads/master
Commit: 6eb203fae7bbc9940710da40f314b89ffb4dd324
Parents: 9c29c55
Author: Sean Owen 
Authored: Fri Dec 1 01:21:52 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 01:21:52 2017 +0900

--
 .../hive/HiveExternalCatalogVersionsSuite.scala | 24 +++-
 1 file changed, 18 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6eb203fa/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 6859432..a3d5b94 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -20,6 +20,8 @@ package org.apache.spark.sql.hive
 import java.io.File
 import java.nio.file.Files
 
+import scala.sys.process._
+
 import org.apache.spark.TestUtils
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
@@ -50,14 +52,24 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 super.afterAll()
   }
 
-  private def downloadSpark(version: String): Unit = {
-import scala.sys.process._
+  private def tryDownloadSpark(version: String, path: String): Unit = {
+// Try mirrors a few times until one succeeds
+for (i <- 0 until 3) {
+  val preferredMirror =
+Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
+  val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
+  logInfo(s"Downloading Spark $version from $url")
+  if (Seq("wget", url, "-q", "-P", path).! == 0) {
+return
+  }
+  logWarning(s"Failed to download Spark $version from $url")
+}
+fail(s"Unable to download Spark $version")
+  }
 
-val preferredMirror =
-  Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
-val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
 
-Seq("wget", url, "-q", "-P", sparkTestingDir.getCanonicalPath).!
+  private def downloadSpark(version: String): Unit = {
+tryDownloadSpark(version, sparkTestingDir.getCanonicalPath)
 
 val downloaded = new File(sparkTestingDir, 
s"spark-$version-bin-hadoop2.7.tgz").getCanonicalPath
 val targetDir = new File(sparkTestingDir, 
s"spark-$version").getCanonicalPath


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files containing special characters

2017-11-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6eb203fae -> 932bd09c8


[SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files 
containing special characters

## What changes were proposed in this pull request?

SPARK-22146 fix the FileNotFoundException issue only for the `inferSchema` 
method, ie. only for the schema inference, but it doesn't fix the problem when 
actually reading the data. Thus nearly the same exception happens when someone 
tries to use the data. This PR covers fixing the problem also there.

## How was this patch tested?

enhanced UT

Author: Marco Gaido 

Closes #19844 from mgaido91/SPARK-22635.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/932bd09c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/932bd09c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/932bd09c

Branch: refs/heads/master
Commit: 932bd09c80dc2dc113e94f59f4dcb77e77de7c58
Parents: 6eb203f
Author: Marco Gaido 
Authored: Fri Dec 1 01:24:15 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 01:24:15 2017 +0900

--
 .../org/apache/spark/sql/hive/orc/OrcFileFormat.scala| 11 +--
 .../spark/sql/hive/MetastoreDataSourcesSuite.scala   |  3 ++-
 2 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/932bd09c/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
index 3b33a9f..95741c7 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
@@ -133,10 +133,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
 (file: PartitionedFile) => {
   val conf = broadcastedHadoopConf.value.value
 
+  val filePath = new Path(new URI(file.filePath))
+
   // SPARK-8501: Empty ORC files always have an empty schema stored in 
their footer. In this
   // case, `OrcFileOperator.readSchema` returns `None`, and we can't read 
the underlying file
   // using the given physical schema. Instead, we simply return an empty 
iterator.
-  val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), 
Some(conf)).isEmpty
+  val isEmptyFile = OrcFileOperator.readSchema(Seq(filePath.toString), 
Some(conf)).isEmpty
   if (isEmptyFile) {
 Iterator.empty
   } else {
@@ -146,15 +148,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   val job = Job.getInstance(conf)
   FileInputFormat.setInputPaths(job, file.filePath)
 
-  val fileSplit = new FileSplit(
-new Path(new URI(file.filePath)), file.start, file.length, 
Array.empty
-  )
+  val fileSplit = new FileSplit(filePath, file.start, file.length, 
Array.empty)
   // Custom OrcRecordReader is used to get
   // ObjectInspector during recordReader creation itself and can
   // avoid NameNode call in unwrapOrcStructs per file.
   // Specifically would be helpful for partitioned datasets.
-  val orcReader = OrcFile.createReader(
-new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
+  val orcReader = OrcFile.createReader(filePath, 
OrcFile.readerOptions(conf))
   new SparkOrcNewRecordReader(orcReader, conf, fileSplit.getStart, 
fileSplit.getLength)
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/932bd09c/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
index a106047..c8caba8 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
@@ -1350,7 +1350,8 @@ class MetastoreDataSourcesSuite extends QueryTest with 
SQLTestUtils with TestHiv
   withTempDir { dir =>
 val tmpFile = s"$dir/$nameWithSpecialChars"
 spark.createDataset(Seq("a", "b")).write.format(format).save(tmpFile)
-spark.read.format(format).load(tmpFile)
+val fileContent = spark.read.format(format).load(tmpFile)
+checkAnswer(fileContent, Seq(Row("a"), Row("b")))
   }
 }
   }



spark git commit: [SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe…

2017-11-27 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 087879a77 -> 33d43bf1b


[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe…

## What changes were proposed in this pull request?

In PySpark API Document, DataFrame.write.csv() says that setting the quote 
parameter to an empty string should turn off quoting. Instead, it uses the 
[null character](https://en.wikipedia.org/wiki/Null_character) as the quote.

This PR fixes the doc.

## How was this patch tested?

Manual.

```
cd python/docs
make html
open _build/html/pyspark.sql.html
```

Author: gaborgsomogyi 

Closes #19814 from gaborgsomogyi/SPARK-22484.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33d43bf1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33d43bf1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33d43bf1

Branch: refs/heads/master
Commit: 33d43bf1b6f55594187066f0e38ba3985fa2542b
Parents: 087879a
Author: gaborgsomogyi 
Authored: Tue Nov 28 10:14:35 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Nov 28 10:14:35 2017 +0900

--
 python/pyspark/sql/readwriter.py  | 3 +--
 .../src/main/scala/org/apache/spark/sql/DataFrameWriter.scala | 3 ++-
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/33d43bf1/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index a75bdf8..1ad974e 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -828,8 +828,7 @@ class DataFrameWriter(OptionUtils):
 set, it uses the default value, ``,``.
 :param quote: sets the single character used for escaping quoted 
values where the
   separator can be part of the value. If None is set, it 
uses the default
-  value, ``"``. If you would like to turn off quotations, 
you need to set an
-  empty string.
+  value, ``"``. If an empty string is set, it uses 
``u`` (null character).
 :param escape: sets the single character used for escaping quotes 
inside an already
quoted value. If None is set, it uses the default 
value, ``\``
 :param escapeQuotes: a flag indicating whether values containing 
quotes should always

http://git-wip-us.apache.org/repos/asf/spark/blob/33d43bf1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index e3fa2ce..35abecc 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -592,7 +592,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) 
{
* `sep` (default `,`): sets the single character as a separator for each
* field and value.
* `quote` (default `"`): sets the single character used for escaping 
quoted values where
-   * the separator can be part of the value.
+   * the separator can be part of the value. If an empty string is set, it 
uses `u`
+   * (null character).
* `escape` (default `\`): sets the single character used for escaping 
quotes inside
* an already quoted value.
* `escapeQuotes` (default `true`): a flag indicating whether values 
containing


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22585][CORE] Path in addJar is not url encoded

2017-11-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 8ff474f6e -> ab6f60c4d


[SPARK-22585][CORE] Path in addJar is not url encoded

## What changes were proposed in this pull request?

This updates a behavior of `addJar` method of `sparkContext` class. If path 
without any scheme is passed as input it is used literally without url 
encoding/decoding it.

## How was this patch tested?

A unit test is added for this.

Author: Jakub Dubovsky 

Closes #19834 from james64/SPARK-22585-encode-add-jar.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab6f60c4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab6f60c4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab6f60c4

Branch: refs/heads/master
Commit: ab6f60c4d6417cbb0240216a6b492aadcca3043e
Parents: 8ff474f
Author: Jakub Dubovsky 
Authored: Thu Nov 30 10:24:30 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Nov 30 10:24:30 2017 +0900

--
 core/src/main/scala/org/apache/spark/SparkContext.scala  |  6 +-
 .../test/scala/org/apache/spark/SparkContextSuite.scala  | 11 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab6f60c4/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 984dd0a..c174939 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -1837,7 +1837,11 @@ class SparkContext(config: SparkConf) extends Logging {
 Utils.validateURL(uri)
 uri.getScheme match {
   // A JAR file which exists only on the driver node
-  case null | "file" => addJarFile(new File(uri.getPath))
+  case null =>
+// SPARK-22585 path without schema is not url encoded
+addJarFile(new File(uri.getRawPath))
+  // A JAR file which exists only on the driver node
+  case "file" => addJarFile(new File(uri.getPath))
   // A JAR file which exists locally on every worker node
   case "local" => "file:" + uri.getPath
   case _ => path

http://git-wip-us.apache.org/repos/asf/spark/blob/ab6f60c4/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/SparkContextSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
index 0ed5f26..2bde875 100644
--- a/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
@@ -309,6 +309,17 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 assert(sc.listJars().head.contains(tmpJar.getName))
   }
 
+  test("SPARK-22585 addJar argument without scheme is interpreted literally 
without url decoding") {
+val tmpDir = new File(Utils.createTempDir(), "host%3A443")
+tmpDir.mkdirs()
+val tmpJar = File.createTempFile("t%2F", ".jar", tmpDir)
+
+sc = new SparkContext("local", "test")
+
+sc.addJar(tmpJar.getAbsolutePath)
+assert(sc.listJars().size === 1)
+  }
+
   test("Cancelling job group should not cause SparkContext to shutdown 
(SPARK-6414)") {
 try {
   sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3

2017-11-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master ab6f60c4d -> 92cfbeeb5


[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in 
Python 3.6.0 / NumPy 1.13.3

## What changes were proposed in this pull request?

Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as 
below:

```
==
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
  File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index

--
Ran 1 test in 7.606s
```

To be clear, I think the error seems from NumPy - 
https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947

For a smaller scope:

```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
  File "", line 1, in 
TypeError: only integer scalar arrays can be converted to a scalar index
```

In Python 2.7 / NumPy 1.13.1, it prints:

```
bytearray(b'\x01')
```

So, here, I simply worked around it by converting it to bytes as below:

```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```

Also, while looking into it again, I realised few arguments could be quite 
confusing, for example, `Row` that needs some specific attributes and 
`numpy.ndarray`. I added few type checking and added some tests accordingly. 
So, it shows an error message as below:

```
TypeError: array argument should be numpy.ndarray; however, it got [].
```

## How was this patch tested?

Manually tested with `./python/run-tests`.

And also:

```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests 
ImageReaderTest
```

Author: hyukjinkwon 

Closes #19835 from HyukjinKwon/SPARK-21866-followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92cfbeeb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92cfbeeb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92cfbeeb

Branch: refs/heads/master
Commit: 92cfbeeb5ce9e2c618a76b3fe60ce84b9d38605b
Parents: ab6f60c
Author: hyukjinkwon 
Authored: Thu Nov 30 10:26:55 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Nov 30 10:26:55 2017 +0900

--
 python/pyspark/ml/image.py | 27 ---
 python/pyspark/ml/tests.py | 20 +++-
 2 files changed, 43 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/92cfbeeb/python/pyspark/ml/image.py
--
diff --git a/python/pyspark/ml/image.py b/python/pyspark/ml/image.py
index 7d14f05..2b61aa9 100644
--- a/python/pyspark/ml/image.py
+++ b/python/pyspark/ml/image.py
@@ -108,12 +108,23 @@ class _ImageSchema(object):
 """
 Converts an image to an array with metadata.
 
-:param image: The image to be converted.
+:param `Row` image: A row that contains the image to be converted. It 
should
+have the attributes specified in `ImageSchema.imageSchema`.
 :return: a `numpy.ndarray` that is an image.
 
 .. versionadded:: 2.3.0
 """
 
+if not isinstance(image, Row):
+raise TypeError(
+"image argument should be pyspark.sql.types.Row; however, "
+"it got [%s]." % type(image))
+
+if any(not hasattr(image, f) for f in self.imageFields):
+raise ValueError(
+"image argument should have attributes specified in "
+"ImageSchema.imageSchema [%s]." % ", ".join(self.imageFields))
+
 height = image.height
 width = image.width
 nChannels = image.nChannels
@@ -127,15 +138,20 @@ class _ImageSchema(object):
 """
 Converts an array with metadata to a two-dimensional image.
 
-:param array array: The array to convert to image.
+:param `numpy.ndarray` array: The array to convert to image.
 :param str origin: Path to the image, optional.
 :return: a :class:`Row` that is a two dimensional image.
 
 .. versionadded:: 2.3.0
 """
 
+if not isinstance(array, np.ndarray):
+raise TypeError(
+"array argument should be 

spark git commit: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages

2017-12-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master ee10ca7ec -> aa4cf2b19


[SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for 
ImageSchema.readImages

## What changes were proposed in this pull request?

Calling `ImageSchema.readImages` multiple times as below in PySpark shell:

```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, 
dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, 
dropImageFailures=True).collect()
```

throws an error as below:

```
...
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating 
connection pool (set lazyInit to true if you expect to start your database 
after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, 
see the next exception for details.
...
at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
...
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
...
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
...
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:100)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:88)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at 
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
at 
org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
at 
org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:574)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:593)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
at 
org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
...
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, 
see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 121 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /.../spark/metastore_db.
...
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
dropImageFailures, float(sampleRatio), seed)
  File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
1160, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 

spark git commit: [SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files containing special characters

2017-12-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 ba00bd961 -> f3f8c8767


[SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files 
containing special characters

## What changes were proposed in this pull request?

SPARK-22146 fix the FileNotFoundException issue only for the `inferSchema` 
method, ie. only for the schema inference, but it doesn't fix the problem when 
actually reading the data. Thus nearly the same exception happens when someone 
tries to use the data. This PR covers fixing the problem also there.

## How was this patch tested?

enhanced UT

Author: Marco Gaido 

Closes #19844 from mgaido91/SPARK-22635.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f3f8c876
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f3f8c876
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f3f8c876

Branch: refs/heads/branch-2.2
Commit: f3f8c8767efbe8c941b4181f71587c65a05e1b82
Parents: ba00bd9
Author: Marco Gaido 
Authored: Fri Dec 1 01:24:15 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 18:18:57 2017 +0900

--
 .../org/apache/spark/sql/hive/orc/OrcFileFormat.scala| 11 +--
 .../spark/sql/hive/MetastoreDataSourcesSuite.scala   |  3 ++-
 2 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f3f8c876/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
index 54e8f82..2defd31 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
@@ -131,10 +131,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
 (file: PartitionedFile) => {
   val conf = broadcastedHadoopConf.value.value
 
+  val filePath = new Path(new URI(file.filePath))
+
   // SPARK-8501: Empty ORC files always have an empty schema stored in 
their footer. In this
   // case, `OrcFileOperator.readSchema` returns `None`, and we can't read 
the underlying file
   // using the given physical schema. Instead, we simply return an empty 
iterator.
-  val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), 
Some(conf)).isEmpty
+  val isEmptyFile = OrcFileOperator.readSchema(Seq(filePath.toString), 
Some(conf)).isEmpty
   if (isEmptyFile) {
 Iterator.empty
   } else {
@@ -144,15 +146,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   val job = Job.getInstance(conf)
   FileInputFormat.setInputPaths(job, file.filePath)
 
-  val fileSplit = new FileSplit(
-new Path(new URI(file.filePath)), file.start, file.length, 
Array.empty
-  )
+  val fileSplit = new FileSplit(filePath, file.start, file.length, 
Array.empty)
   // Custom OrcRecordReader is used to get
   // ObjectInspector during recordReader creation itself and can
   // avoid NameNode call in unwrapOrcStructs per file.
   // Specifically would be helpful for partitioned datasets.
-  val orcReader = OrcFile.createReader(
-new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
+  val orcReader = OrcFile.createReader(filePath, 
OrcFile.readerOptions(conf))
   new SparkOrcNewRecordReader(orcReader, conf, fileSplit.getStart, 
fileSplit.getLength)
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f3f8c876/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
index c0acffb..d62ed19 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
@@ -1355,7 +1355,8 @@ class MetastoreDataSourcesSuite extends QueryTest with 
SQLTestUtils with TestHiv
   withTempDir { dir =>
 val tmpFile = s"$dir/$nameWithSpecialChars"
 spark.createDataset(Seq("a", "b")).write.format(format).save(tmpFile)
-spark.read.format(format).load(tmpFile)
+val fileContent = spark.read.format(format).load(tmpFile)
+checkAnswer(fileContent, Seq(Row("a"), Row("b")))
   }
 }
   }



spark git commit: [SPARK-22811][PYSPARK][ML] Fix pyspark.ml.tests failure when Hive is not available.

2017-12-15 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 46776234a -> 0c8fca460


[SPARK-22811][PYSPARK][ML] Fix pyspark.ml.tests failure when Hive is not 
available.

## What changes were proposed in this pull request?

pyspark.ml.tests is missing a py4j import. I've added the import and fixed the 
test that uses it. This test was only failing when testing without Hive.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Bago Amirbekian 

Closes #19997 from MrBago/fix-ImageReaderTest2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0c8fca46
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0c8fca46
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0c8fca46

Branch: refs/heads/master
Commit: 0c8fca4608643ed9e1eb3ae8620e6f4f6a017a87
Parents: 4677623
Author: Bago Amirbekian 
Authored: Sat Dec 16 10:57:35 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Dec 16 10:57:35 2017 +0900

--
 python/pyspark/ml/tests.py | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0c8fca46/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 3a0b816..be15211 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -44,6 +44,7 @@ import array as pyarray
 import numpy as np
 from numpy import abs, all, arange, array, array_equal, inf, ones, tile, zeros
 import inspect
+import py4j
 
 from pyspark import keyword_only, SparkContext
 from pyspark.ml import Estimator, Model, Pipeline, PipelineModel, Transformer, 
UnaryTransformer
@@ -1859,8 +1860,9 @@ class ImageReaderTest2(PySparkTestCase):
 
 @classmethod
 def setUpClass(cls):
-PySparkTestCase.setUpClass()
+super(ImageReaderTest2, cls).setUpClass()
 # Note that here we enable Hive's support.
+cls.spark = None
 try:
 cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
 except py4j.protocol.Py4JError:
@@ -1873,8 +1875,10 @@ class ImageReaderTest2(PySparkTestCase):
 
 @classmethod
 def tearDownClass(cls):
-PySparkTestCase.tearDownClass()
-cls.spark.sparkSession.stop()
+super(ImageReaderTest2, cls).tearDownClass()
+if cls.spark is not None:
+cls.spark.sparkSession.stop()
+cls.spark = None
 
 def test_read_images_multiple_times(self):
 # This test case is to check if `ImageSchema.readImages` tries to


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-19809][SQL][TEST] NullPointerException on zero-size ORC file

2017-12-12 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 704af4bd6 -> 17cdabb88


[SPARK-19809][SQL][TEST] NullPointerException on zero-size ORC file

## What changes were proposed in this pull request?

Until 2.2.1, Spark raises `NullPointerException` on zero-size ORC files. 
Usually, these zero-size ORC files are generated by 3rd-party apps like Flume.

```scala
scala> sql("create table empty_orc(a int) stored as orc location 
'/tmp/empty_orc'")

$ touch /tmp/empty_orc/zero.orc

scala> sql("select * from empty_orc").show
java.lang.RuntimeException: serious problem at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
...
Caused by: java.lang.NullPointerException at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
```

After [SPARK-22279](https://github.com/apache/spark/pull/19499), Apache Spark 
with the default configuration doesn't have this bug. Although Hive 1.2.1 
library code path still has the problem, we had better have a test coverage on 
what we have now in order to prevent future regression on it.

## How was this patch tested?

Pass a newly added test case.

Author: Dongjoon Hyun 

Closes #19948 from dongjoon-hyun/SPARK-19809-EMPTY-FILE.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17cdabb8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17cdabb8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17cdabb8

Branch: refs/heads/master
Commit: 17cdabb88761e67ca555299109f89afdf02a4280
Parents: 704af4b
Author: Dongjoon Hyun 
Authored: Wed Dec 13 07:42:24 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Dec 13 07:42:24 2017 +0900

--
 .../spark/sql/hive/execution/SQLQuerySuite.scala   | 17 +
 1 file changed, 17 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/17cdabb8/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
index f2562c3..93c91d3 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
@@ -2172,4 +2172,21 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-19809 NullPointerException on zero-size ORC file") {
+Seq("native", "hive").foreach { orcImpl =>
+  withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl) {
+withTempPath { dir =>
+  withTable("spark_19809") {
+sql(s"CREATE TABLE spark_19809(a int) STORED AS ORC LOCATION 
'$dir'")
+Files.touch(new File(s"${dir.getCanonicalPath}", "zero.orc"))
+
+withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { // 
default since 2.3.0
+  checkAnswer(sql("SELECT * FROM spark_19809"), Seq.empty)
+}
+  }
+}
+  }
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22813][BUILD] Use lsof or /usr/sbin/lsof in run-tests.py

2017-12-18 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master fbfa9be7e -> 3a07eff5a


[SPARK-22813][BUILD] Use lsof or /usr/sbin/lsof in run-tests.py

## What changes were proposed in this pull request?

In [the environment where `/usr/sbin/lsof` does not 
exist](https://github.com/apache/spark/pull/19695#issuecomment-342865001), 
`./dev/run-tests.py` for `maven` causes the following error. This is because 
the current `./dev/run-tests.py` checks existence of only `/usr/sbin/lsof` and 
aborts immediately if it does not exist.

This PR changes to check whether `lsof` or `/usr/sbin/lsof` exists.

```
/bin/sh: 1: /usr/sbin/lsof: not found

Usage:
 kill [options]  [...]

Options:
  [...]send signal to every  listed
 -, -s, --signal 
specify the  to be sent
 -l, --list=[]  list all signal names, or convert one to a name
 -L, --tablelist all signal names in a nice table

 -h, --help display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
Traceback (most recent call last):
  File "./dev/run-tests.py", line 626, in 
main()
  File "./dev/run-tests.py", line 597, in main
build_apache_spark(build_tool, hadoop_version)
  File "./dev/run-tests.py", line 389, in build_apache_spark
build_spark_maven(hadoop_version)
  File "./dev/run-tests.py", line 329, in build_spark_maven
exec_maven(profiles_and_goals)
  File "./dev/run-tests.py", line 270, in exec_maven
kill_zinc_on_port(zinc_port)
  File "./dev/run-tests.py", line 258, in kill_zinc_on_port
subprocess.check_call(cmd, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
```

## How was this patch tested?

manually tested

Author: Kazuaki Ishizaki 

Closes #19998 from kiszk/SPARK-22813.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a07eff5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a07eff5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a07eff5

Branch: refs/heads/master
Commit: 3a07eff5af601511e97a05e6fea0e3d48f74c4f0
Parents: fbfa9be
Author: Kazuaki Ishizaki 
Authored: Tue Dec 19 07:35:03 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Dec 19 07:35:03 2017 +0900

--
 dev/run-tests.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3a07eff5/dev/run-tests.py
--
diff --git a/dev/run-tests.py b/dev/run-tests.py
index ef0e788..7e6f7ff 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -253,9 +253,9 @@ def kill_zinc_on_port(zinc_port):
 """
 Kill the Zinc process running on the given port, if one exists.
 """
-cmd = ("/usr/sbin/lsof -P |grep %s | grep LISTEN "
-   "| awk '{ print $2; }' | xargs kill") % zinc_port
-subprocess.check_call(cmd, shell=True)
+cmd = "%s -P |grep %s | grep LISTEN | awk '{ print $2; }' | xargs kill"
+lsof_exe = which("lsof")
+subprocess.check_call(cmd % (lsof_exe if lsof_exe else "/usr/sbin/lsof", 
zinc_port), shell=True)
 
 
 def exec_maven(mvn_args=()):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Revert "Revert "[SPARK-22496][SQL] thrift server adds operation logs""

2017-12-18 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 772e4648d -> fbfa9be7e


Revert "Revert "[SPARK-22496][SQL] thrift server adds operation logs""

This reverts commit e58f275678fb4f904124a4a2a1762f04c835eb0e.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fbfa9be7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fbfa9be7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fbfa9be7

Branch: refs/heads/master
Commit: fbfa9be7e0df0b4489571422c45d0d64d05d3050
Parents: 772e464
Author: hyukjinkwon 
Authored: Tue Dec 19 07:30:29 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Dec 19 07:30:29 2017 +0900

--
 .../cli/operation/ExecuteStatementOperation.java   | 13 +
 .../hive/service/cli/operation/SQLOperation.java   | 12 
 .../thriftserver/SparkExecuteStatementOperation.scala  |  1 +
 3 files changed, 14 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fbfa9be7/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/ExecuteStatementOperation.java
--
diff --git 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/ExecuteStatementOperation.java
 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/ExecuteStatementOperation.java
index 3f2de10..6740d3b 100644
--- 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/ExecuteStatementOperation.java
+++ 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/ExecuteStatementOperation.java
@@ -23,6 +23,7 @@ import java.util.Map;
 
 import org.apache.hadoop.hive.ql.processors.CommandProcessor;
 import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory;
+import org.apache.hadoop.hive.ql.session.OperationLog;
 import org.apache.hive.service.cli.HiveSQLException;
 import org.apache.hive.service.cli.OperationType;
 import org.apache.hive.service.cli.session.HiveSession;
@@ -67,4 +68,16 @@ public abstract class ExecuteStatementOperation extends 
Operation {
   this.confOverlay = confOverlay;
 }
   }
+
+  protected void registerCurrentOperationLog() {
+if (isOperationLogEnabled) {
+  if (operationLog == null) {
+LOG.warn("Failed to get current OperationLog object of Operation: " +
+  getHandle().getHandleIdentifier());
+isOperationLogEnabled = false;
+return;
+  }
+  OperationLog.setCurrentOperationLog(operationLog);
+}
+  }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/fbfa9be7/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java
--
diff --git 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java
 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java
index 5014ced..fd9108e 100644
--- 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java
+++ 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java
@@ -274,18 +274,6 @@ public class SQLOperation extends 
ExecuteStatementOperation {
 }
   }
 
-  private void registerCurrentOperationLog() {
-if (isOperationLogEnabled) {
-  if (operationLog == null) {
-LOG.warn("Failed to get current OperationLog object of Operation: " +
-getHandle().getHandleIdentifier());
-isOperationLogEnabled = false;
-return;
-  }
-  OperationLog.setCurrentOperationLog(operationLog);
-}
-  }
-
   private void cleanup(OperationState state) throws HiveSQLException {
 setState(state);
 if (shouldRunAsync()) {

http://git-wip-us.apache.org/repos/asf/spark/blob/fbfa9be7/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
--
diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
index f5191fa..664bc20 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
@@ -170,6 +170,7 @@ private[hive] class SparkExecuteStatementOperation(
 override def run(): Unit = {
   val doAsAction = new PrivilegedExceptionAction[Unit]() {
 override 

spark git commit: [SPARK-22817][R] Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 0c8fca460 -> c2aeddf9e


[SPARK-22817][R] Use fixed testthat version for SparkR tests in AppVeyor

## What changes were proposed in this pull request?

`testthat` 2.0.0 is released and AppVeyor now started to use it instead of 
1.0.2. And then, we started to have R tests failed in AppVeyor. See - 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1967-master

```
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
```

This seems because we rely on internal `testthat:::run_tests` here:

https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75

https://github.com/apache/spark/blob/dc4c351837879dab26ad8fb471dc51c06832a9e4/R/pkg/tests/run-all.R#L49-L52

However, seems it was removed out from 2.0.0.  I tried few other exposed APIs 
like `test_dir` but I failed to make a good compatible fix.

Seems we better fix the `testthat` version first to make the build passed.

## How was this patch tested?

Manually tested and AppVeyor tests.

Author: hyukjinkwon 

Closes #20003 from HyukjinKwon/SPARK-22817.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c2aeddf9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c2aeddf9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c2aeddf9

Branch: refs/heads/master
Commit: c2aeddf9eae2f8f72c244a4b16af264362d6cf5d
Parents: 0c8fca4
Author: hyukjinkwon 
Authored: Sun Dec 17 14:40:41 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Dec 17 14:40:41 2017 +0900

--
 appveyor.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c2aeddf9/appveyor.yml
--
diff --git a/appveyor.yml b/appveyor.yml
index 4874092..aee94c5 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -42,7 +42,9 @@ install:
   # Install maven and dependencies
   - ps: .\dev\appveyor-install-dependencies.ps1
   # Required package for R unit tests
-  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+  # Here, we use the fixed version of testthat. For more details, please see 
SPARK-22817.
+  - cmd: R -e "devtools::install_version('testthat', version = '1.0.2', 
repos='http://cran.us.r-project.org')"
   - cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); 
packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')"
 
 build_script:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22817][R] Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b4f4be396 -> 1e4cca02f


[SPARK-22817][R] Use fixed testthat version for SparkR tests in AppVeyor

## What changes were proposed in this pull request?

`testthat` 2.0.0 is released and AppVeyor now started to use it instead of 
1.0.2. And then, we started to have R tests failed in AppVeyor. See - 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1967-master

```
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
```

This seems because we rely on internal `testthat:::run_tests` here:

https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75

https://github.com/apache/spark/blob/dc4c351837879dab26ad8fb471dc51c06832a9e4/R/pkg/tests/run-all.R#L49-L52

However, seems it was removed out from 2.0.0.  I tried few other exposed APIs 
like `test_dir` but I failed to make a good compatible fix.

Seems we better fix the `testthat` version first to make the build passed.

## How was this patch tested?

Manually tested and AppVeyor tests.

Author: hyukjinkwon 

Closes #20003 from HyukjinKwon/SPARK-22817.

(cherry picked from commit c2aeddf9eae2f8f72c244a4b16af264362d6cf5d)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e4cca02
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e4cca02
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e4cca02

Branch: refs/heads/branch-2.2
Commit: 1e4cca02f9ea4899fa59e6df4295780f0729d6d2
Parents: b4f4be3
Author: hyukjinkwon 
Authored: Sun Dec 17 14:40:41 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Dec 17 14:41:05 2017 +0900

--
 appveyor.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e4cca02/appveyor.yml
--
diff --git a/appveyor.yml b/appveyor.yml
index bc527e8..c7660f1 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -42,7 +42,9 @@ install:
   # Install maven and dependencies
   - ps: .\dev\appveyor-install-dependencies.ps1
   # Required package for R unit tests
-  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+  # Here, we use the fixed version of testthat. For more details, please see 
SPARK-22817.
+  - cmd: R -e "devtools::install_version('testthat', version = '1.0.2', 
repos='http://cran.us.r-project.org')"
   - cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); 
packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')"
 
 build_script:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh

2017-11-13 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 ca19271cc -> 7bdad58e2


[SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in 
release-build.sh

## What changes were proposed in this pull request?

This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to 
fix nightly snapshot jenkins jobs. Please refer 
https://github.com/apache/spark/pull/19359#issuecomment-340139557:

> Looks like some of the snapshot builds are having lsof issues:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
>
>https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console
>
>spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
>found
>usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]

Up to my knowledge,  the full path of `lsof` is required for non-root user in 
few OSs.

## How was this patch tested?

Manually tested as below:

```bash
#!/usr/bin/env bash

LSOF=lsof
if ! hash $LSOF 2>/dev/null; then
  echo "a"
  LSOF=/usr/sbin/lsof
fi

$LSOF -P | grep "a"
```

Author: hyukjinkwon 

Closes #19695 from HyukjinKwon/SPARK-22377.

(cherry picked from commit c8b7f97b8a58bf4a9f6e3a07dd6e5b0f646d8d99)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7bdad58e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7bdad58e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7bdad58e

Branch: refs/heads/branch-2.1
Commit: 7bdad58e2baac98e7b77f17aaa6c88de230a220e
Parents: ca19271
Author: hyukjinkwon 
Authored: Tue Nov 14 08:28:13 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Nov 14 08:28:43 2017 +0900

--
 dev/create-release/release-build.sh | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7bdad58e/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index ad32c31..eefd864 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -121,6 +121,13 @@ else
   fi
 fi
 
+# This is a band-aid fix to avoid the failure of Maven nightly snapshot in 
some Jenkins
+# machines by explicitly calling /usr/sbin/lsof. Please see SPARK-22377 and 
the discussion
+# in its pull request.
+LSOF=lsof
+if ! hash $LSOF 2>/dev/null; then
+  LSOF=/usr/sbin/lsof
+fi
 
 if [ -z "$SPARK_PACKAGE_VERSION" ]; then
   SPARK_PACKAGE_VERSION="${SPARK_VERSION}-$(date +%Y_%m_%d_%H_%M)-${git_hash}"
@@ -341,7 +348,7 @@ if [[ "$1" == "publish-snapshot" ]]; then
 -DskipTests $PUBLISH_PROFILES clean deploy
 
   # Clean-up Zinc nailgun process
-  lsof -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
+  $LSOF -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
 
   rm $tmp_settings
   cd ..
@@ -379,7 +386,7 @@ if [[ "$1" == "publish-release" ]]; then
 -DskipTests $PUBLISH_PROFILES clean install
 
   # Clean-up Zinc nailgun process
-  lsof -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
+  $LSOF -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
 
   ./dev/change-version-to-2.10.sh
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh

2017-11-13 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f7534b37e -> c8b7f97b8


[SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in 
release-build.sh

## What changes were proposed in this pull request?

This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to 
fix nightly snapshot jenkins jobs. Please refer 
https://github.com/apache/spark/pull/19359#issuecomment-340139557:

> Looks like some of the snapshot builds are having lsof issues:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
>
>https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console
>
>spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
>found
>usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]

Up to my knowledge,  the full path of `lsof` is required for non-root user in 
few OSs.

## How was this patch tested?

Manually tested as below:

```bash
#!/usr/bin/env bash

LSOF=lsof
if ! hash $LSOF 2>/dev/null; then
  echo "a"
  LSOF=/usr/sbin/lsof
fi

$LSOF -P | grep "a"
```

Author: hyukjinkwon 

Closes #19695 from HyukjinKwon/SPARK-22377.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c8b7f97b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c8b7f97b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c8b7f97b

Branch: refs/heads/master
Commit: c8b7f97b8a58bf4a9f6e3a07dd6e5b0f646d8d99
Parents: f7534b3
Author: hyukjinkwon 
Authored: Tue Nov 14 08:28:13 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Nov 14 08:28:13 2017 +0900

--
 dev/create-release/release-build.sh | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c8b7f97b/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 7e8d5c7..5b43f9b 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -130,6 +130,13 @@ else
   fi
 fi
 
+# This is a band-aid fix to avoid the failure of Maven nightly snapshot in 
some Jenkins
+# machines by explicitly calling /usr/sbin/lsof. Please see SPARK-22377 and 
the discussion
+# in its pull request.
+LSOF=lsof
+if ! hash $LSOF 2>/dev/null; then
+  LSOF=/usr/sbin/lsof
+fi
 
 if [ -z "$SPARK_PACKAGE_VERSION" ]; then
   SPARK_PACKAGE_VERSION="${SPARK_VERSION}-$(date +%Y_%m_%d_%H_%M)-${git_hash}"
@@ -345,7 +352,7 @@ if [[ "$1" == "publish-snapshot" ]]; then
   #  -DskipTests $SCALA_2_12_PROFILES $PUBLISH_PROFILES clean deploy
 
   # Clean-up Zinc nailgun process
-  lsof -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
+  $LSOF -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
 
   rm $tmp_settings
   cd ..
@@ -382,7 +389,7 @@ if [[ "$1" == "publish-release" ]]; then
   #  -DskipTests $SCALA_2_12_PROFILES §$PUBLISH_PROFILES clean install
 
   # Clean-up Zinc nailgun process
-  lsof -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
+  $LSOF -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
 
   #./dev/change-scala-version.sh 2.11
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh

2017-11-13 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 d905e85d2 -> 3ea6fd0c4


[SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in 
release-build.sh

## What changes were proposed in this pull request?

This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to 
fix nightly snapshot jenkins jobs. Please refer 
https://github.com/apache/spark/pull/19359#issuecomment-340139557:

> Looks like some of the snapshot builds are having lsof issues:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
>
>https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console
>
>spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
>found
>usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]

Up to my knowledge,  the full path of `lsof` is required for non-root user in 
few OSs.

## How was this patch tested?

Manually tested as below:

```bash
#!/usr/bin/env bash

LSOF=lsof
if ! hash $LSOF 2>/dev/null; then
  echo "a"
  LSOF=/usr/sbin/lsof
fi

$LSOF -P | grep "a"
```

Author: hyukjinkwon 

Closes #19695 from HyukjinKwon/SPARK-22377.

(cherry picked from commit c8b7f97b8a58bf4a9f6e3a07dd6e5b0f646d8d99)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ea6fd0c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ea6fd0c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ea6fd0c

Branch: refs/heads/branch-2.2
Commit: 3ea6fd0c4610cd5cd0762802e88ac392c92d631c
Parents: d905e85
Author: hyukjinkwon 
Authored: Tue Nov 14 08:28:13 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Nov 14 08:28:28 2017 +0900

--
 dev/create-release/release-build.sh | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3ea6fd0c/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 819f325..1272b6d 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -121,6 +121,13 @@ else
   fi
 fi
 
+# This is a band-aid fix to avoid the failure of Maven nightly snapshot in 
some Jenkins
+# machines by explicitly calling /usr/sbin/lsof. Please see SPARK-22377 and 
the discussion
+# in its pull request.
+LSOF=lsof
+if ! hash $LSOF 2>/dev/null; then
+  LSOF=/usr/sbin/lsof
+fi
 
 if [ -z "$SPARK_PACKAGE_VERSION" ]; then
   SPARK_PACKAGE_VERSION="${SPARK_VERSION}-$(date +%Y_%m_%d_%H_%M)-${git_hash}"
@@ -337,7 +344,7 @@ if [[ "$1" == "publish-snapshot" ]]; then
 -DskipTests $PUBLISH_PROFILES clean deploy
 
   # Clean-up Zinc nailgun process
-  lsof -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
+  $LSOF -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
 
   rm $tmp_settings
   cd ..
@@ -375,7 +382,7 @@ if [[ "$1" == "publish-release" ]]; then
 -DskipTests $PUBLISH_PROFILES clean install
 
   # Clean-up Zinc nailgun process
-  lsof -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
+  $LSOF -P |grep $ZINC_PORT | grep LISTEN | awk '{ print $2; }' | xargs kill
 
   ./dev/change-version-to-2.10.sh
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22554][PYTHON] Add a config to control if PySpark should use daemon or not for workers

2017-11-19 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master b10837ab1 -> 57c5514de


[SPARK-22554][PYTHON] Add a config to control if PySpark should use daemon or 
not for workers

## What changes were proposed in this pull request?

This PR proposes to add a flag to control if PySpark should use daemon or not.

Actually, SparkR already has a flag for useDaemon:
https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362

It'd be great if we have this flag too. It makes easier to debug Windows 
specific issue.

## How was this patch tested?

Manually tested.

Author: hyukjinkwon 

Closes #19782 from HyukjinKwon/use-daemon-flag.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57c5514d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57c5514d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57c5514d

Branch: refs/heads/master
Commit: 57c5514de9dba1c14e296f85fb13fef23ce8c73f
Parents: b10837a
Author: hyukjinkwon 
Authored: Mon Nov 20 13:34:06 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Nov 20 13:34:06 2017 +0900

--
 .../org/apache/spark/api/python/PythonWorkerFactory.scala | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/57c5514d/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
index fc595ae..f53c617 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
@@ -38,7 +38,12 @@ private[spark] class PythonWorkerFactory(pythonExec: String, 
envVars: Map[String
   // (pyspark/daemon.py) and tell it to fork new workers for our tasks. This 
daemon currently
   // only works on UNIX-based systems now because it uses signals for child 
management, so we can
   // also fall back to launching workers (pyspark/worker.py) directly.
-  val useDaemon = !System.getProperty("os.name").startsWith("Windows")
+  val useDaemon = {
+val useDaemonEnabled = 
SparkEnv.get.conf.getBoolean("spark.python.use.daemon", true)
+
+// This flag is ignored on Windows as it's unable to fork.
+!System.getProperty("os.name").startsWith("Windows") && useDaemonEnabled
+  }
 
   var daemon: Process = null
   val daemonHost = InetAddress.getByAddress(Array(127, 0, 0, 1))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22557][TEST] Use ThreadSignaler explicitly

2017-11-19 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master d54bfec2e -> b10837ab1


[SPARK-22557][TEST] Use ThreadSignaler explicitly

## What changes were proposed in this pull request?

ScalaTest 3.0 uses an implicit `Signaler`. This PR makes it sure all Spark 
tests uses `ThreadSignaler` explicitly which has the same default behavior of 
interrupting a thread on the JVM like ScalaTest 2.2.x. This will reduce 
potential flakiness.

## How was this patch tested?

This is testsuite-only update. This should passes the Jenkins tests.

Author: Dongjoon Hyun 

Closes #19784 from dongjoon-hyun/use_thread_signaler.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b10837ab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b10837ab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b10837ab

Branch: refs/heads/master
Commit: b10837ab1a7bef04bf7a2773b9e44ed9206643fe
Parents: d54bfec
Author: Dongjoon Hyun 
Authored: Mon Nov 20 13:32:01 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Nov 20 13:32:01 2017 +0900

--
 .../test/scala/org/apache/spark/DistributedSuite.scala|  7 +--
 core/src/test/scala/org/apache/spark/DriverSuite.scala|  5 -
 core/src/test/scala/org/apache/spark/UnpersistSuite.scala |  8 ++--
 .../scala/org/apache/spark/deploy/SparkSubmitSuite.scala  |  9 -
 .../scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala |  5 -
 .../org/apache/spark/scheduler/DAGSchedulerSuite.scala|  5 -
 .../OutputCommitCoordinatorIntegrationSuite.scala |  5 -
 .../org/apache/spark/storage/BlockManagerSuite.scala  | 10 --
 .../test/scala/org/apache/spark/util/EventLoopSuite.scala |  5 -
 .../execution/streaming/ProcessingTimeExecutorSuite.scala |  8 +---
 .../scala/org/apache/spark/sql/streaming/StreamTest.scala |  2 ++
 .../org/apache/spark/sql/hive/SparkSubmitTestUtils.scala  |  5 -
 .../scala/org/apache/spark/streaming/ReceiverSuite.scala  |  5 +++--
 .../apache/spark/streaming/StreamingContextSuite.scala|  5 +++--
 .../spark/streaming/receiver/BlockGeneratorSuite.scala|  7 ---
 15 files changed, 68 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b10837ab/core/src/test/scala/org/apache/spark/DistributedSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/DistributedSuite.scala 
b/core/src/test/scala/org/apache/spark/DistributedSuite.scala
index f800561..ea9f6d2 100644
--- a/core/src/test/scala/org/apache/spark/DistributedSuite.scala
+++ b/core/src/test/scala/org/apache/spark/DistributedSuite.scala
@@ -18,7 +18,7 @@
 package org.apache.spark
 
 import org.scalatest.Matchers
-import org.scalatest.concurrent.TimeLimits._
+import org.scalatest.concurrent.{Signaler, ThreadSignaler, TimeLimits}
 import org.scalatest.time.{Millis, Span}
 
 import org.apache.spark.security.EncryptionFunSuite
@@ -30,7 +30,10 @@ class NotSerializableExn(val notSer: NotSerializableClass) 
extends Throwable() {
 
 
 class DistributedSuite extends SparkFunSuite with Matchers with 
LocalSparkContext
-  with EncryptionFunSuite {
+  with EncryptionFunSuite with TimeLimits {
+
+  // Necessary to make ScalaTest 3.x interrupt a thread on the JVM like 
ScalaTest 2.2.x
+  implicit val defaultSignaler: Signaler = ThreadSignaler
 
   val clusterUrl = "local-cluster[2,1,1024]"
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b10837ab/core/src/test/scala/org/apache/spark/DriverSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/DriverSuite.scala 
b/core/src/test/scala/org/apache/spark/DriverSuite.scala
index be80d27..962945e 100644
--- a/core/src/test/scala/org/apache/spark/DriverSuite.scala
+++ b/core/src/test/scala/org/apache/spark/DriverSuite.scala
@@ -19,7 +19,7 @@ package org.apache.spark
 
 import java.io.File
 
-import org.scalatest.concurrent.TimeLimits
+import org.scalatest.concurrent.{Signaler, ThreadSignaler, TimeLimits}
 import org.scalatest.prop.TableDrivenPropertyChecks._
 import org.scalatest.time.SpanSugar._
 
@@ -27,6 +27,9 @@ import org.apache.spark.util.Utils
 
 class DriverSuite extends SparkFunSuite with TimeLimits {
 
+  // Necessary to make ScalaTest 3.x interrupt a thread on the JVM like 
ScalaTest 2.2.x
+  implicit val defaultSignaler: Signaler = ThreadSignaler
+
   ignore("driver should exit after finishing without cleanup (SPARK-530)") {
 val sparkHome = sys.props.getOrElse("spark.test.home", 
fail("spark.test.home is not set!"))
 val masters = Table("master", "local", "local-cluster[2,1,1024]")


spark git commit: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow

2017-11-15 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master dce1610ae -> 8f0e88df0


[SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in 
createDataFrame with Arrow

## What changes were proposed in this pull request?

If schema is passed as a list of unicode strings for column names, they should 
be re-encoded to 'utf-8' to be consistent.  This is similar to the #13097 but 
for creation of DataFrame using Arrow.

## How was this patch tested?

Added new test of using unicode names for schema.

Author: Bryan Cutler 

Closes #19738 from 
BryanCutler/arrow-createDataFrame-followup-unicode-SPARK-20791.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8f0e88df
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8f0e88df
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8f0e88df

Branch: refs/heads/master
Commit: 8f0e88df03a06a91bb61c6e0d69b1b19e2bfb3f7
Parents: dce1610
Author: Bryan Cutler 
Authored: Wed Nov 15 23:35:13 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Nov 15 23:35:13 2017 +0900

--
 python/pyspark/sql/session.py |  7 ---
 python/pyspark/sql/tests.py   | 10 ++
 2 files changed, 14 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8f0e88df/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 589365b..dbbcfff 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -592,6 +592,9 @@ class SparkSession(object):
 
 if isinstance(schema, basestring):
 schema = _parse_datatype_string(schema)
+elif isinstance(schema, (list, tuple)):
+# Must re-encode any unicode strings to be consistent with 
StructField names
+schema = [x.encode('utf-8') if not isinstance(x, str) else x for x 
in schema]
 
 try:
 import pandas
@@ -602,7 +605,7 @@ class SparkSession(object):
 
 # If no schema supplied by user then get the names of columns only
 if schema is None:
-schema = [str(x) for x in data.columns]
+schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in data.columns]
 
 if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
 and len(data) > 0:
@@ -630,8 +633,6 @@ class SparkSession(object):
 verify_func(obj)
 return obj,
 else:
-if isinstance(schema, (list, tuple)):
-schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in schema]
 prepare = lambda obj: obj
 
 if isinstance(data, RDD):

http://git-wip-us.apache.org/repos/asf/spark/blob/8f0e88df/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 6356d93..ef592c2 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3225,6 +3225,16 @@ class ArrowTests(ReusedSQLTestCase):
 df = self.spark.createDataFrame(pdf, schema=tuple('abcdefg'))
 self.assertEquals(df.schema.fieldNames(), list('abcdefg'))
 
+def test_createDataFrame_column_name_encoding(self):
+import pandas as pd
+pdf = pd.DataFrame({u'a': [1]})
+columns = self.spark.createDataFrame(pdf).columns
+self.assertTrue(isinstance(columns[0], str))
+self.assertEquals(columns[0], 'a')
+columns = self.spark.createDataFrame(pdf, [u'b']).columns
+self.assertTrue(isinstance(columns[0], str))
+self.assertEquals(columns[0], 'b')
+
 def test_createDataFrame_with_single_data_type(self):
 import pandas as pd
 with QuietTest(self.sc):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22476][R] Add dayofweek function to R

2017-11-11 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 3eb315d71 -> 223d83ee9


[SPARK-22476][R] Add dayofweek function to R

## What changes were proposed in this pull request?

This PR adds `dayofweek` to R API:

```r
data <- list(list(d = as.Date("2012-12-13")),
 list(d = as.Date("2013-12-14")),
 list(d = as.Date("2014-12-15")))
df <- createDataFrame(data)
collect(select(df, dayofweek(df$d)))
```

```
  dayofweek(d)
15
27
32
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon 

Closes #19706 from HyukjinKwon/add-dayofweek.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/223d83ee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/223d83ee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/223d83ee

Branch: refs/heads/master
Commit: 223d83ee93e604009afea4af3d13a838d08625a4
Parents: 3eb315d
Author: hyukjinkwon 
Authored: Sat Nov 11 19:16:31 2017 +0900
Committer: hyukjinkwon 
Committed: Sat Nov 11 19:16:31 2017 +0900

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 17 -
 R/pkg/R/generics.R|  5 +
 R/pkg/tests/fulltests/test_sparkSQL.R |  1 +
 4 files changed, 23 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/223d83ee/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 3fc756b..57838f5 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -232,6 +232,7 @@ exportMethods("%<=>%",
   "date_sub",
   "datediff",
   "dayofmonth",
+  "dayofweek",
   "dayofyear",
   "decode",
   "dense_rank",

http://git-wip-us.apache.org/repos/asf/spark/blob/223d83ee/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 0143a3e..237ef06 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -696,7 +696,7 @@ setMethod("hash",
 #'
 #' \dontrun{
 #' head(select(df, df$time, year(df$time), quarter(df$time), month(df$time),
-#'dayofmonth(df$time), dayofyear(df$time), weekofyear(df$time)))
+#' dayofmonth(df$time), dayofweek(df$time), dayofyear(df$time), 
weekofyear(df$time)))
 #' head(agg(groupBy(df, year(df$time)), count(df$y), avg(df$y)))
 #' head(agg(groupBy(df, month(df$time)), avg(df$y)))}
 #' @note dayofmonth since 1.5.0
@@ -708,6 +708,21 @@ setMethod("dayofmonth",
   })
 
 #' @details
+#' \code{dayofweek}: Extracts the day of the week as an integer from a
+#' given date/timestamp/string.
+#'
+#' @rdname column_datetime_functions
+#' @aliases dayofweek dayofweek,Column-method
+#' @export
+#' @note dayofweek since 2.3.0
+setMethod("dayofweek",
+  signature(x = "Column"),
+  function(x) {
+jc <- callJStatic("org.apache.spark.sql.functions", "dayofweek", 
x@jc)
+column(jc)
+  })
+
+#' @details
 #' \code{dayofyear}: Extracts the day of the year as an integer from a
 #' given date/timestamp/string.
 #'

http://git-wip-us.apache.org/repos/asf/spark/blob/223d83ee/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 8312d41..8fcf269 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1051,6 +1051,11 @@ setGeneric("dayofmonth", function(x) { 
standardGeneric("dayofmonth") })
 #' @rdname column_datetime_functions
 #' @export
 #' @name NULL
+setGeneric("dayofweek", function(x) { standardGeneric("dayofweek") })
+
+#' @rdname column_datetime_functions
+#' @export
+#' @name NULL
 setGeneric("dayofyear", function(x) { standardGeneric("dayofyear") })
 
 #' @rdname column_string_functions

http://git-wip-us.apache.org/repos/asf/spark/blob/223d83ee/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index a0dbd47..8a7fb12 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -1699,6 +1699,7 @@ test_that("date functions on a DataFrame", {
 list(a = 2L, b = as.Date("2013-12-14")),
 list(a = 3L, b = as.Date("2014-12-15")))
   df <- createDataFrame(l)
+  expect_equal(collect(select(df, dayofweek(df$b)))[, 1], c(5, 7, 2))
   expect_equal(collect(select(df, dayofmonth(df$b)))[, 1], c(13, 14, 15))
   expect_equal(collect(select(df, dayofyear(df$b)))[, 1], c(348, 

  1   2   3   4   5   6   7   8   9   10   >