date:20180530

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21466
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3719/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21466
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21463: [SPARK-23754][BRANCH-2.3][PYTHON] Re-raising StopIterati...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21463
  
Let me leave cc for you @icexelloss or @viirya. Seems we should really 
clean up in the master and try it in worker side - I need to take a closer look 
too. If you guys all happened to be busy, will take a look by myself.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread countmdm

Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
I confirm that the -XX:+UseStringDeduplication option is available only 
with G1 GC, and it is off by default. So if we decide to use it, I guess we 
won't be able to enforce it reliably, especially for applications that just use 
some library code from Spark (by the way, this issue was found in Yarn Node 
Manager). Another issue with the string deduplication in G1 is that it's not 
aggressive at all. It's done by one thread, that scans the entire heap when 
other threads are loaded lightly enough. In the case that we investigated, the 
number of duplicate strings was high, yet they were relatively short-lived. 
That is, they survived for enough time to create significant pressure on the 
GC, but I don't think that this timeframe would be enough for the deduplication 
thread to eliminate them.

In summary, this kind of targeted explicit string deduplication is not 
uncommon at all, and works really well. Usually you just need to add the 
.intern() call in a few places in the code. What I had to do here is quite 
involved because of the extra problem with java.io.File.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21466
  
**[Test build #91327 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91327/testReport)**
 for PR 21466 at commit 
[`5a2b7c7`](https://github.com/apache/spark/commit/5a2b7c7c7fa5fc7ae669858c76a5dd43adfdf966).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192001814
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
+
+sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the 
sequence column in " +
+"dataset, rows with nulls in this column are 
ignored.",
+typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence"):
+"""
+__init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+ sequenceCol="sequence")
+"""
+super(PrefixSpan, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+self._setDefault(minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence")
+kwargs = self._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.4.0")
+def setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+  sequenceCol="sequence"):
+"""
+setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+  sequenceCol="sequence")
+"""
+kwargs = self._input_kwargs
+return self._set(**kwargs)
+
+@since("2.4.0")
+def findFrequentSequentialPatterns(self, dataset):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
--- End diff --

There is no `Dataset` in PySpark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192001923
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
+
+sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the 
sequence column in " +
+"dataset, rows with nulls in this column are 
ignored.",
+typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence"):
+"""
+__init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+ sequenceCol="sequence")
+"""
+super(PrefixSpan, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+self._setDefault(minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence")
+kwargs = self._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.4.0")
+def setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+  sequenceCol="sequence"):
+"""
+setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+  sequenceCol="sequence")
+"""
+kwargs = self._input_kwargs
+return self._set(**kwargs)
+
+@since("2.4.0")
+def findFrequentSequentialPatterns(self, dataset):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192002383
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, 
itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def findFrequentSequentialPatterns(dataset,
+   sequenceCol,
+   minSupport,
+   maxPatternLength,
+   maxLocalProjDBSize):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:param sequenceCol: The name of the sequence column in dataset, 
rows with nulls in this
+column are ignored.
+:param minSupport: The minimal support level of the sequential 
pattern, any pattern that
+   appears more than (minSupport * 
size-of-the-dataset) times will be
+   output (recommended value: `0.1`).
+:param maxPatternLength: The maximal length of the sequential 
pattern
+ (recommended value: `10`).
+:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in the
+   internal storage format) allowed in a 
projected database before
+   local processing. If a projected 
database exceeds this size,
+   another iteration of distributed prefix 
growth is run
+   (recommended value: `3200`).
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
+  - `freq: Long`
+
+>>> from pyspark.ml.fpm import PrefixSpan
+>>> from pyspark.sql import Row
+>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --

We should keep doctest examples simple to read. For example, including 
`maxLocalProjDBSize` is not useful because we don't expect users to tuning this 
param often.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192000950
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -53,7 +53,7 @@ final class PrefixSpan(@Since("2.4.0") override val uid: 
String) extends Params
   @Since("2.4.0")
   val minSupport = new DoubleParam(this, "minSupport", "The minimal 
support level of the " +
 "sequential pattern. Sequential pattern that appears more than " +
-"(minSupport * size-of-the-dataset)." +
+"(minSupport * size-of-the-dataset)" +
--- End diff --

Need a space at the end before "times".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192001886
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
+
+sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the 
sequence column in " +
+"dataset, rows with nulls in this column are 
ignored.",
+typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence"):
+"""
+__init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+ sequenceCol="sequence")
+"""
+super(PrefixSpan, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+self._setDefault(minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence")
+kwargs = self._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.4.0")
+def setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+  sequenceCol="sequence"):
+"""
+setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+  sequenceCol="sequence")
+"""
+kwargs = self._input_kwargs
+return self._set(**kwargs)
+
+@since("2.4.0")
+def findFrequentSequentialPatterns(self, dataset):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
--- End diff --

We should use a SQL type here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192002416
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
+
+sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the 
sequence column in " +
+"dataset, rows with nulls in this column are 
ignored.",
+typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence"):
+"""
+__init__(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+ sequenceCol="sequence")
+"""
+super(PrefixSpan, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+self._setDefault(minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+ sequenceCol="sequence")
+kwargs = self._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.4.0")
+def setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200,
+  sequenceCol="sequence"):
+"""
+setParams(self, minSupport=0.1, maxPatternLength=10, 
maxLocalProjDBSize=3200, \
+  sequenceCol="sequence")
+"""
+kwargs = self._input_kwargs
+return self._set(**kwargs)
+
+@since("2.4.0")
+def findFrequentSequentialPatterns(self, dataset):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
+  - `freq: Long`
+
+>>> from pyspark.ml.fpm import PrefixSpan
+>>> from pyspark.sql import Row
+>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
+...  Row(sequence=[[1], [3, 2], [1, 2]]),
+...  Row(sequence=[[1, 2], [5]]),
+...  Row(sequence=[[6]])]).toDF()
+>>> prefixSpan = PrefixSpan(minSupport=0.5, maxPatternLength=5,
+... maxLocalProjDBSize=3200)
--- End diff --

remove th

[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21466
  
cc @jerryshao, not a big deal but I thought it's useful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21466: [MINOR][YARN] Add YARN-specific credential provid...

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/21466

[MINOR][YARN] Add YARN-specific credential providers in debug logging 
message

## What changes were proposed in this pull request?

This PR adds a debugging log for YARN-specific credential providers which 
is loaded by service loader mechanism.

It took me a while to debug if it's actually loaded or not. I had to 
explicitly set the deprecated configuration and check if that's actually being 
loaded.

## How was this patch tested?

The change scope is manually tested. Logs are like:

```
Using the following builtin delegation token providers: hadoopfs, hive, 
hbase.
Using the following YARN-specific credential providers: yarn-test.
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-log

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21466.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21466


commit 5a2b7c7c7fa5fc7ae669858c76a5dd43adfdf966
Author: hyukjinkwon 
Date:   2018-05-31T06:31:19Z

Add YARN-specific credential providers in debug logging message




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21381
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21381
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3718/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21381
  
**[Test build #91326 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91326/testReport)**
 for PR 21381 at commit 
[`54b3b5f`](https://github.com/apache/spark/commit/54b3b5f1c973638d4c32d2b297c8b7c9ff72f28a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192000596
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
--- End diff --

Just test that python 'int' type range is the same with java 'long' type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...

2018-05-30 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21381
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

2018-05-30 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21333#discussion_r192000399
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -154,6 +154,13 @@ class RDDSuite extends SparkFunSuite with 
SharedSparkContext {
 }
   }
 
+  test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {
--- End diff --

Have we tested when all input RDDs are empty?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91321/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21456
  
**[Test build #91321 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)**
 for PR 21456 at commit 
[`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21464
  
**[Test build #4191 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4191/testReport)**
 for PR 21464 at commit 
[`90c9ddc`](https://github.com/apache/spark/commit/90c9ddca2ecb458ccde2945ab67548403c3b4256).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21265
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3717/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21265
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r191996249
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
--- End diff --

There isn't `TypeConverters.toLong`, do I need to add it ?
My idea is that `TypeConverters.toInt` also fit to Long type in python side 
so I do not add it for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21265
  
**[Test build #91325 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91325/testReport)**
 for PR 21265 at commit 
[`0be3a94`](https://github.com/apache/spark/commit/0be3a94d27f4203608ef82d2ef197b37606c53b3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r191995667
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, 
itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def findFrequentSequentialPatterns(dataset,
+   sequenceCol,
+   minSupport,
+   maxPatternLength,
+   maxLocalProjDBSize):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:param sequenceCol: The name of the sequence column in dataset, 
rows with nulls in this
+column are ignored.
+:param minSupport: The minimal support level of the sequential 
pattern, any pattern that
+   appears more than (minSupport * 
size-of-the-dataset) times will be
+   output (recommended value: `0.1`).
+:param maxPatternLength: The maximal length of the sequential 
pattern
+ (recommended value: `10`).
+:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in the
+   internal storage format) allowed in a 
projected database before
+   local processing. If a projected 
database exceeds this size,
+   another iteration of distributed prefix 
growth is run
+   (recommended value: `3200`).
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
+  - `freq: Long`
+
+>>> from pyspark.ml.fpm import PrefixSpan
+>>> from pyspark.sql import Row
+>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --

I think it is better to be put in a example. @mengxr What do you think ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21362: [SPARK-24197][SparkR][FOLLOWUP] Fixing failing te...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21362#discussion_r191994811
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -1504,15 +1504,16 @@ test_that("column functions", {
   expect_equal(result, "cba")
 
   # Test array_sort() and sort_array()
-  df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 
5L, NA, 4L
+  df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 5L, 
NA, 4L
+  as_integer_lists <- function(x) lapply(x, lapply, as.integer)
 
-  result <- collect(select(df, array_sort(df[[1]])))[[1]]
-  expect_equal(result, list(list(1L, 2L, 3L, NA), list(4L, 5L, 6L, NA, 
NA)))
+  result <- as_integer_lists(collect(select(df, array_sort(df[[1]])))[[1]])
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21454: [SPARK-24337][Core] Improve error messages for Sp...

2018-05-30 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21454


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-30 Thread ifilonenko

Github user ifilonenko commented on the issue:

https://github.com/apache/spark/pull/21092
  
Will resolve comments today @mccheah


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20276: [SPARK-14948][SQL] disambiguate attributes in join condi...

2018-05-30 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20276
  
Hi, @cloud-fan . If this PR is still valid, could you resolve the conflicts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91318/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21454
  
**[Test build #91318 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91318/testReport)**
 for PR 21454 at commit 
[`38ffa3e`](https://github.com/apache/spark/commit/38ffa3e551c7eee69d12cc736d33d137abd333b7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21362: [SPARK-24197][SparkR][FOLLOWUP] Fixing failing te...

2018-05-30 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21362#discussion_r191986143
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -1504,15 +1504,16 @@ test_that("column functions", {
   expect_equal(result, "cba")
 
   # Test array_sort() and sort_array()
-  df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 
5L, NA, 4L
+  df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 5L, 
NA, 4L
+  as_integer_lists <- function(x) lapply(x, lapply, as.integer)
 
-  result <- collect(select(df, array_sort(df[[1]])))[[1]]
-  expect_equal(result, list(list(1L, 2L, 3L, NA), list(4L, 5L, 6L, NA, 
NA)))
+  result <- as_integer_lists(collect(select(df, array_sort(df[[1]])))[[1]])
--- End diff --

`as_integer_lists` doesn't seem like the right approach - that just 
basically changes everything returned into integer (even though it might not be 
returned that way from JVM)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91319/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
**[Test build #91319 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91319/testReport)**
 for PR 21366 at commit 
[`260d82c`](https://github.com/apache/spark/commit/260d82ca9fbbd16ad8174d0dafa2f95bc177a219).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21362: [SPARK-24197][SparkR][FOLLOWUP] Fixing failing tests for...

2018-05-30 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21362
  
ok, let us know if you have more information on this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

2018-05-30 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/21453
  
I filed an issue in Scala community about the interface changes, and they 
said those REPL apis are intended to be private. 
https://github.com/scala/bug/issues/10913

Being said that, they gave us couple ways to walk around it, and I'm 
testing it now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...

2018-05-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21455
  
Simply making these fields publicly accessible seems a little weird from 
Spark's side. Maybe we can use reflection instead.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-30 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21454
  
LGTM

Thanks! Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21346
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3716/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21346
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

2018-05-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21453
  
Scala 2.11.12 cannot be built against with current Spark, due to some 
method changes in REPL. We have tried internally.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21346
  
**[Test build #91324 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91324/testReport)**
 for PR 21346 at commit 
[`83c3271`](https://github.com/apache/spark/commit/83c3271d2f45bbef18d865bddbc6807e9fbd2503).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91316/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Github user squito commented on the issue:

https://github.com/apache/spark/pull/21456
  
From my understanding, that option is only available with G1GC, which is 
not really a good fit for spark (forget the exact details but something about 
humongous allocations which are common with all the large byte buffers typical 
for spark).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21454
  
**[Test build #91316 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91316/testReport)**
 for PR 21454 at commit 
[`b7ff38f`](https://github.com/apache/spark/commit/b7ff38f16c91b7df326f49d1f821b14a6dc82e8d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...

Github user squito commented on the issue:

https://github.com/apache/spark/pull/21346

> is this effectively dead code at this point?

yes, thats right. this PR by itself is not useful. Its a step towards
https://github.com/apache/spark/pull/21451

This is a good point to put in the PR summary -- I'll do that, and also
your summary notes above, if you don't mind.

> what are the major risks of this change in terms of introducing
performance or correctness issues? If we identify risks (e.g. "this is a
historically tricky area of code?"), can we mitigate those risks through
correctness testing / load testing?

I've made an effort to make minimal modifications to all existing code
paths, to minimize the risk of introducing bugs in current functionality. My
intention is to only turn it on by default initially for cases we know would
fail with the old code -- when the data is > 2gb
([SPARK-24297](https://issues.apache.org/jira/browse/SPARK-24297)). I've added
unit tests and shared the test I'm doing on a cluster just to find holes in
functionality (posted on the parent jira here:
https://issues.apache.org/jira/browse/SPARK-6235?focusedCommentId=16484069&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16484069).
I have not done load testing yet but plan to. Extra testing, of course,
would certainly be good.

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191981552
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java
 ---
@@ -38,15 +38,24 @@
*
* This method will not be called in parallel for a single 
TransportClient (i.e., channel).
*
+   * The rpc *might* included a data stream in streamData 
(eg. for uploading a large
+   * amount of data which should not be buffered in memory here).  Any 
errors while handling the
+   * streamData will lead to failing this entire connection -- all other 
in-flight rpcs will fail.
--- End diff --

pretty good question actually :)

I will take a closer look at this myself but I believe this connection is 
shared by other tasks running on the same executor which are trying to talk to 
the same destination.  So that might mean another task which is replicating to 
the same destination, or reading data from that same remote executor.  those 
don't have specific retry behavior for connection closed -- that might result 
in the data just not getting replicated, fetching data from elsewhere, or the 
task getting retried.

I think this is actually OK -- the existing code could cause an OOM on the 
remote end anyway, which obviously would fail a lot more.   This failure 
behavior seems reasonable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21010
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21010
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3715/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21400: [SPARK-24351][SS]offsetLog/commitLog purge thresh...

2018-05-30 Thread ivoson

Github user ivoson commented on a diff in the pull request:

https://github.com/apache/spark/pull/21400#discussion_r191980682
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala
 ---
@@ -34,7 +34,8 @@ class ContinuousSuiteBase extends StreamTest {
 new SparkContext(
   "local[10]",
   "continuous-stream-test-sql-context",
-  sparkConf.set("spark.sql.testkey", "true")))
+  sparkConf.set("spark.sql.testkey", "true")
+.set("spark.sql.streaming.minBatchesToRetain", "2")))
--- End diff --

@jose-torres Thanks for pointing this out. I will make a separate suite for 
clarity and isolation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21445: [SPARK-24404][SS] Increase currentEpoch when meet a Epoc...

2018-05-30 Thread advancedxy

Github user advancedxy commented on the issue:

https://github.com/apache/spark/pull/21445
  
> I think the best way to do it is to make the shuffle writer responsible 
for incrementing the epoch within its task, the same way the data source writer 
does currently.

Yeah, @LiangchangZ please consider this way. The writer part of a task is 
responsible to pull data from upstream. It's more consistent and wouldn't break 
existing logic.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21010
  
**[Test build #91323 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91323/testReport)**
 for PR 21010 at commit 
[`9ccb648`](https://github.com/apache/spark/commit/9ccb6488f6f8309e0cfa71c4b332e6d680f24ffa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-30 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21010
  
LGTM pending Jenkins.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-30 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21010
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-30 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21010
  
Thanks, sounds good.
Let me retrigger the build just for checking again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191979425
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/protocol/UploadStream.java
 ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.protocol;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+import com.google.common.base.Objects;
+import io.netty.buffer.ByteBuf;
+
+import org.apache.spark.network.buffer.ManagedBuffer;
+import org.apache.spark.network.buffer.NettyManagedBuffer;
+
+/**
+ * An RPC with data that is sent outside of the frame, so it can be read 
as a stream.
+ */
+public final class UploadStream extends AbstractMessage implements 
RequestMessage {
+  /** Used to link an RPC request with its response. */
+  public final long requestId;
+  public final ManagedBuffer meta;
+  public final long bodyByteCount;
+
+  public UploadStream(long requestId, ManagedBuffer meta, ManagedBuffer 
body) {
+super(body, false); // body is *not* included in the frame
+this.requestId = requestId;
+this.meta = meta;
+bodyByteCount = body.size();
+  }
+
+  // this version is called when decoding the bytes on the receiving end.  
The body is handled
+  // separately.
+  private UploadStream(long requestId, ManagedBuffer meta, long 
bodyByteCount) {
+super(null, false);
+this.requestId = requestId;
+this.meta = meta;
+this.bodyByteCount = bodyByteCount;
+  }
+
+  @Override
+  public Type type() { return Type.UploadStream; }
+
+  @Override
+  public int encodedLength() {
+// the requestId, meta size, meta and bodyByteCount (body is not 
included)
+return 8 + 4 + ((int) meta.size()) + 8;
+  }
+
+  @Override
+  public void encode(ByteBuf buf) {
+buf.writeLong(requestId);
+try {
+  ByteBuffer metaBuf = meta.nioByteBuffer();
+  buf.writeInt(metaBuf.remaining());
+  buf.writeBytes(metaBuf);
+} catch (IOException io) {
+  throw new RuntimeException(io);
+}
+buf.writeLong(bodyByteCount);
+  }
+
+  public static UploadStream decode(ByteBuf buf) {
+long requestId = buf.readLong();
+int metaSize = buf.readInt();
+ManagedBuffer meta = new 
NettyManagedBuffer(buf.readRetainedSlice(metaSize));
+long bodyByteCount = buf.readLong();
+// This is called by the frame decoder, so the data is still null.  We 
need a StreamInterceptor
+// to read the data.
+return new UploadStream(requestId, meta, bodyByteCount);
+  }
+
+  @Override
+  public int hashCode() {
+return Objects.hashCode(requestId, body());
+  }
+
+  @Override
+  public boolean equals(Object other) {
+if (other instanceof UploadStream) {
+  UploadStream o = (UploadStream) other;
+  return requestId == o.requestId && super.equals(o);
+}
+return false;
+  }
+
+  @Override
+  public String toString() {
+return Objects.toStringHelper(this)
+  .add("requestId", requestId)
+  .add("body", body())
--- End diff --

to be honest, this was also just parroted from other classes -- looking now 
at implementations of ManagedBuffer, if they have a `toString()` it does 
something reasonable.

Is that actually useful for debugging?  maybe not, don't think I ever 
actually looked at this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191979019
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/protocol/UploadStream.java
 ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.protocol;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+import com.google.common.base.Objects;
+import io.netty.buffer.ByteBuf;
+
+import org.apache.spark.network.buffer.ManagedBuffer;
+import org.apache.spark.network.buffer.NettyManagedBuffer;
+
+/**
+ * An RPC with data that is sent outside of the frame, so it can be read 
as a stream.
+ */
+public final class UploadStream extends AbstractMessage implements 
RequestMessage {
+  /** Used to link an RPC request with its response. */
+  public final long requestId;
+  public final ManagedBuffer meta;
+  public final long bodyByteCount;
+
+  public UploadStream(long requestId, ManagedBuffer meta, ManagedBuffer 
body) {
+super(body, false); // body is *not* included in the frame
+this.requestId = requestId;
+this.meta = meta;
+bodyByteCount = body.size();
+  }
+
+  // this version is called when decoding the bytes on the receiving end.  
The body is handled
+  // separately.
+  private UploadStream(long requestId, ManagedBuffer meta, long 
bodyByteCount) {
+super(null, false);
+this.requestId = requestId;
+this.meta = meta;
+this.bodyByteCount = bodyByteCount;
+  }
+
+  @Override
+  public Type type() { return Type.UploadStream; }
+
+  @Override
+  public int encodedLength() {
+// the requestId, meta size, meta and bodyByteCount (body is not 
included)
+return 8 + 4 + ((int) meta.size()) + 8;
+  }
+
+  @Override
+  public void encode(ByteBuf buf) {
+buf.writeLong(requestId);
+try {
+  ByteBuffer metaBuf = meta.nioByteBuffer();
+  buf.writeInt(metaBuf.remaining());
+  buf.writeBytes(metaBuf);
+} catch (IOException io) {
+  throw new RuntimeException(io);
+}
+buf.writeLong(bodyByteCount);
+  }
+
+  public static UploadStream decode(ByteBuf buf) {
+long requestId = buf.readLong();
+int metaSize = buf.readInt();
+ManagedBuffer meta = new 
NettyManagedBuffer(buf.readRetainedSlice(metaSize));
+long bodyByteCount = buf.readLong();
+// This is called by the frame decoder, so the data is still null.  We 
need a StreamInterceptor
+// to read the data.
+return new UploadStream(requestId, meta, bodyByteCount);
+  }
+
+  @Override
+  public int hashCode() {
+return Objects.hashCode(requestId, body());
--- End diff --

this is a good point.  Admittedly I just copied this from `StreamResponse` 
without thinking about it too much -- that class exhibits the same issue.  I'll 
remove `body` from both.

(In practice, we're not using sticking them in hashmaps now so there 
wouldn't be any bugs in behavior because of this.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191978545
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
 ---
@@ -141,26 +141,14 @@ public void fetchChunk(
 StreamChunkId streamChunkId = new StreamChunkId(streamId, chunkIndex);
 handler.addFetchRequest(streamChunkId, callback);
 
-channel.writeAndFlush(new 
ChunkFetchRequest(streamChunkId)).addListener(future -> {
--- End diff --

yes exactly.  Marcelo asked for this refactoring in his review -- there was 
already a ton of copy-paste, and instead of adding more made sense to refactor. 
 Shouldn't be any behavior change (there are minor changes that shouldn't 
matter ...  `channel.close()` happens before the more specific cleanup 
operations whereas it was in the middle previously, the `try` encompasses a bit 
more than before.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191978140
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/client/StreamInterceptor.java
 ---
@@ -50,16 +52,22 @@
 
   @Override
   public void exceptionCaught(Throwable cause) throws Exception {
-handler.deactivateStream();
+deactivateStream();
 callback.onFailure(streamId, cause);
   }
 
   @Override
   public void channelInactive() throws Exception {
-handler.deactivateStream();
+deactivateStream();
 callback.onFailure(streamId, new ClosedChannelException());
   }
 
+  private void deactivateStream() {
+if (handler instanceof TransportResponseHandler) {
--- End diff --

the only purpose of `TransportResponseHandler.deactivateStream()` is to 
include the stream request in the count for `numOutstandingRequests` (its not 
doing any critical cleanup).  I will include a comment here explaining that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191976952
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/server/StreamData.java
 ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.server;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+import org.apache.spark.network.client.RpcResponseCallback;
+import org.apache.spark.network.client.StreamCallback;
+import org.apache.spark.network.client.StreamInterceptor;
+import org.apache.spark.network.util.TransportFrameDecoder;
+
+/**
+ * A holder for streamed data sent along with an RPC message.
+ */
+public class StreamData {
+
+  private final TransportRequestHandler handler;
+  private final TransportFrameDecoder frameDecoder;
+  private final RpcResponseCallback rpcCallback;
+  private final ByteBuffer meta;
--- End diff --

whoops, you're right.  I was using this at one point in the follow-on 
patch, then changed it and didn't fully clean this up.  thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21464
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91314/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21464
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21464
  
**[Test build #91314 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91314/testReport)**
 for PR 21464 at commit 
[`90c9ddc`](https://github.com/apache/spark/commit/90c9ddca2ecb458ccde2945ab67548403c3b4256).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21464
  
**[Test build #4191 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4191/testReport)**
 for PR 21464 at commit 
[`90c9ddc`](https://github.com/apache/spark/commit/90c9ddca2ecb458ccde2945ab67548403c3b4256).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21447: [SPARK-24339][SQL]Add project for transform/map/reduce s...

2018-05-30 Thread xdcjie

Github user xdcjie commented on the issue:

https://github.com/apache/spark/pull/21447
  
@maropu @gatorsmile Do you have any comment/suggestion for this PR? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91315/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21454
  
**[Test build #91315 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91315/testReport)**
 for PR 21454 at commit 
[`c1b4d16`](https://github.com/apache/spark/commit/c1b4d1670fe1bb5c2b9b0d66c14e3da26627d29e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21453
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21453
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91322/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21453
  
**[Test build #91322 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91322/testReport)**
 for PR 21453 at commit 
[`90d3842`](https://github.com/apache/spark/commit/90d3842616aec94d603a68d44463eb043c5a66f9).
 * This patch **fails build dependency tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21456
  
**[Test build #91321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)**
 for PR 21456 at commit 
[`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21453
  
**[Test build #91322 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91322/testReport)**
 for PR 21453 at commit 
[`90d3842`](https://github.com/apache/spark/commit/90d3842616aec94d603a68d44463eb043c5a66f9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21453
  
Jenkins, test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21456
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21453
  
Jenkins, add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-30 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21010
  
> Basically LGTM, but I'm wondering what if the expr2 is not like a format 
string?

The same as Hive:
```sql
spark-sql> SELECT format_number(12332.123456, 'abc');
abc12332
```
```sql
hive> SELECT format_number(12332.123456, 'abc');
OK
abc12332
Time taken: 0.218 seconds, Fetched: 1 row(s)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocal...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21437#discussion_r191969840
  
--- Diff: python/pyspark/taskcontext.py ---
@@ -88,3 +89,9 @@ def taskAttemptId(self):
 TaskAttemptID.
 """
 return self._taskAttemptId
+
+def getLocalProperty(self, key):
+"""
+Get a local property set upstream in the driver, or None if it is 
missing.
--- End diff --

Makes sense but +1 for leaving it out since I either don't know how 
commonly it will be used for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21378: [SPARK-24326][Mesos] add support for local:// sch...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21378#discussion_r191969654
  
--- Diff: docs/running-on-mesos.md ---
@@ -753,6 +753,16 @@ See the [configuration page](configuration.html) for 
information on Spark config
 spark.cores.max is reached
   
 
++
--- End diff --

+1 for reverting whitespaces. Let's leave this minimised and targeted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21378: [SPARK-24326][Mesos] add support for local:// scheme for...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21378
  
@felixcheung yup. I was just looking at this PR out of my curiosity. I 
don't currently have an env to test Mesos.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21426: [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correc...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21426
  
@vanzin and @jerryshao, thanks you so much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21464
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21464
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91313/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21464
  
**[Test build #91313 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91313/testReport)**
 for PR 21464 at commit 
[`aad159c`](https://github.com/apache/spark/commit/aad159c561094b53a719c8950fa087dacd1d9d8d).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20697
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3581/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20697: [SPARK-23010][k8s] Initial checkin of k8s integra...

2018-05-30 Thread ssuchter

Github user ssuchter commented on a diff in the pull request:

https://github.com/apache/spark/pull/20697#discussion_r191966223
  
--- Diff: 
resource-managers/kubernetes/integration-tests/scripts/setup-integration-test-env.sh
 ---
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+TEST_ROOT_DIR=$(git rev-parse --show-toplevel)
+UNPACKED_SPARK_TGZ="$TEST_ROOT_DIR/target/spark-dist-unpacked"
+IMAGE_TAG_OUTPUT_FILE="$TEST_ROOT_DIR/target/image-tag.txt"
+DEPLOY_MODE="minikube"
+IMAGE_REPO="docker.io/kubespark"
+IMAGE_TAG="N/A"
+SPARK_TGZ="N/A"
+
+# Parse arguments
+while (( "$#" )); do
+  case $1 in
+--unpacked-spark-tgz)
+  UNPACKED_SPARK_TGZ="$2"
+  shift
+  ;;
+--image-repo)
+  IMAGE_REPO="$2"
+  shift
+  ;;
+--image-tag)
+  IMAGE_TAG="$2"
+  shift
+  ;;
+--image-tag-output-file)
+  IMAGE_TAG_OUTPUT_FILE="$2"
+  shift
+  ;;
+--deploy-mode)
+  DEPLOY_MODE="$2"
+  shift
+  ;;
+--spark-tgz)
+  SPARK_TGZ="$2"
+  shift
+  ;;
+*)
+  break
+  ;;
+  esac
+  shift
+done
+
+if [[ $SPARK_TGZ == "N/A" ]];
+then
+  echo "Must specify a Spark tarball to build Docker images against with 
--spark-tgz." && exit 1;
--- End diff --

Ok, it's important for me to be clear here. There are currently two PRBs. 
This will continue in the immediate future.

1. General Spark PRB, mainly for unit tests. This can run on all hosts.

2. K8s integration-specific PRB. This early-outs on many PRs that don't 
seem relevant. This is specifically for running K8s integration tests, and can 
only run on some hosts.

Because of the host restriction issue, these are two separate PRBs.

It is definitely true that each one of these will build the main Spark jars 
separately, so that 11 minute time will be spent twice. Since the 
K8s-integration PRB is only doing this on a small set of PRs, it's not a 
significant cost to the Jenkins infrastructure.

Within the K8s-integration PRB, the entire maven reactor is only built 
once, during the make distribution step. The integration test step doesn't 
rebuild it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20697
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3714/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20697
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20697
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3581/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20697: [SPARK-23010][k8s] Initial checkin of k8s integra...

2018-05-30 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20697#discussion_r191965147
  
--- Diff: 
resource-managers/kubernetes/integration-tests/scripts/setup-integration-test-env.sh
 ---
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+TEST_ROOT_DIR=$(git rev-parse --show-toplevel)
+UNPACKED_SPARK_TGZ="$TEST_ROOT_DIR/target/spark-dist-unpacked"
+IMAGE_TAG_OUTPUT_FILE="$TEST_ROOT_DIR/target/image-tag.txt"
+DEPLOY_MODE="minikube"
+IMAGE_REPO="docker.io/kubespark"
+IMAGE_TAG="N/A"
+SPARK_TGZ="N/A"
+
+# Parse arguments
+while (( "$#" )); do
+  case $1 in
+--unpacked-spark-tgz)
+  UNPACKED_SPARK_TGZ="$2"
+  shift
+  ;;
+--image-repo)
+  IMAGE_REPO="$2"
+  shift
+  ;;
+--image-tag)
+  IMAGE_TAG="$2"
+  shift
+  ;;
+--image-tag-output-file)
+  IMAGE_TAG_OUTPUT_FILE="$2"
+  shift
+  ;;
+--deploy-mode)
+  DEPLOY_MODE="$2"
+  shift
+  ;;
+--spark-tgz)
+  SPARK_TGZ="$2"
+  shift
+  ;;
+*)
+  break
+  ;;
+  esac
+  shift
+done
+
+if [[ $SPARK_TGZ == "N/A" ]];
+then
+  echo "Must specify a Spark tarball to build Docker images against with 
--spark-tgz." && exit 1;
--- End diff --

Sorry, I mean if the Spark PRB will try to build the entire maven reactor 
twice - once for unit tests and once for integration tests. The TGZ bundling in 
and of itself I agree should be fast if the jars are already built by the maven 
reactor. But it's unclear to me if we'll end up building jars redundantly here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...

2018-05-30 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/21346
  
Summary of key changes (WIP; notes to self):


> Summary of changes:
> 
> - Introduce a new `UploadStream` RPC which is sent to push a large 
payload as a stream (in contrast, the pre-existing `StreamRequest` and 
`StreamResponse` RPCs are used for pull-based  streaming).
> - Generalize `RpcHandler.receive()` to support requests which contain 
streams. 
> - Generalize `StreamInterceptor` to handle both request and response 
messages (previously it only handled responses).
> - Introduce `StdChannelListener` to abstract away common logging logic in 
`ChannelFuture` listeners.

Question: is this effectively dead code at this point? In other words, this 
PR just adds the lower-level pieces but there's nothing currently using the new 
API? So this patch as of now has no behavior change and actual functional 
changes impacting queries / actual usage will come later when we wire this up 
to the block replicator?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...

2018-05-30 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21346#discussion_r191964329
  
--- Diff: project/MimaExcludes.scala ---
@@ -36,6 +36,9 @@ object MimaExcludes {
 
   // Exclude rules for 2.4.x
   lazy val v24excludes = v23excludes ++ Seq(
+// [SPARK-6237][NETWORK] Network-layer changes to allow stream upload
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.network.netty.NettyBlockRpcServer.receive"),
--- End diff --

I suspect that it's because we might want to access these across Java 
package boundaries and Java doesn't have the equivalent of Scala's nested 
package scoped `private[package]`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20697
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...