[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21466 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3719/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21466 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21463: [SPARK-23754][BRANCH-2.3][PYTHON] Re-raising StopIterati...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21463 Let me leave cc for you @icexelloss or @viirya. Seems we should really clean up in the master and try it in worker side - I need to take a closer look too. If you guys all happened to be busy, will take a look by myself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 I confirm that the -XX:+UseStringDeduplication option is available only with G1 GC, and it is off by default. So if we decide to use it, I guess we won't be able to enforce it reliably, especially for applications that just use some library code from Spark (by the way, this issue was found in Yarn Node Manager). Another issue with the string deduplication in G1 is that it's not aggressive at all. It's done by one thread, that scans the entire heap when other threads are loaded lightly enough. In the case that we investigated, the number of duplicate strings was high, yet they were relatively short-lived. That is, they survived for enough time to create significant pressure on the GC, but I don't think that this timeframe would be enough for the deduplication thread to eliminate them. In summary, this kind of targeted explicit string deduplication is not uncommon at all, and works really well. Usually you just need to add the .intern() call in a few places in the code. What I had to do here is quite involved because of the extra problem with java.io.File. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21466 **[Test build #91327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91327/testReport)** for PR 21466 at commit [`5a2b7c7`](https://github.com/apache/spark/commit/5a2b7c7c7fa5fc7ae669858c76a5dd43adfdf966). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192001814 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(JavaParams): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). +This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns` +method to run the PrefixSpan algorithm. + +@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining +(Wikipedia) +.. versionadded:: 2.4.0 + +""" + +minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " + + "sequential pattern. Sequential pattern that appears more than " + + "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.", + typeConverter=TypeConverters.toFloat) + +maxPatternLength = Param(Params._dummy(), "maxPatternLength", + "The maximal length of the sequential pattern. Must be > 0.", + typeConverter=TypeConverters.toInt) + +maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize", + "The maximum number of items (including delimiters used in the " + + "internal storage format) allowed in a projected database before " + + "local processing. If a projected database exceeds this size, " + + "another iteration of distributed prefix growth is run. " + + "Must be > 0.", + typeConverter=TypeConverters.toInt) + +sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " + +"dataset, rows with nulls in this column are ignored.", +typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +__init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +super(PrefixSpan, self).__init__() +self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid) +self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence") +kwargs = self._input_kwargs +self.setParams(**kwargs) + +@keyword_only +@since("2.4.0") +def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +kwargs = self._input_kwargs +return self._set(**kwargs) + +@since("2.4.0") +def findFrequentSequentialPatterns(self, dataset): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is --- End diff -- There is no `Dataset` in PySpark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192001923 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(JavaParams): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). +This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns` +method to run the PrefixSpan algorithm. + +@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining +(Wikipedia) +.. versionadded:: 2.4.0 + +""" + +minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " + + "sequential pattern. Sequential pattern that appears more than " + + "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.", + typeConverter=TypeConverters.toFloat) + +maxPatternLength = Param(Params._dummy(), "maxPatternLength", + "The maximal length of the sequential pattern. Must be > 0.", + typeConverter=TypeConverters.toInt) + +maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize", + "The maximum number of items (including delimiters used in the " + + "internal storage format) allowed in a projected database before " + + "local processing. If a projected database exceeds this size, " + + "another iteration of distributed prefix growth is run. " + + "Must be > 0.", + typeConverter=TypeConverters.toInt) + +sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " + +"dataset, rows with nulls in this column are ignored.", +typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +__init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +super(PrefixSpan, self).__init__() +self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid) +self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence") +kwargs = self._input_kwargs +self.setParams(**kwargs) + +@keyword_only +@since("2.4.0") +def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +kwargs = self._input_kwargs +return self._set(**kwargs) + +@since("2.4.0") +def findFrequentSequentialPatterns(self, dataset): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. +:return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192002383 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(object): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def findFrequentSequentialPatterns(dataset, + sequenceCol, + minSupport, + maxPatternLength, + maxLocalProjDBSize): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. +:param sequenceCol: The name of the sequence column in dataset, rows with nulls in this +column are ignored. +:param minSupport: The minimal support level of the sequential pattern, any pattern that + appears more than (minSupport * size-of-the-dataset) times will be + output (recommended value: `0.1`). +:param maxPatternLength: The maximal length of the sequential pattern + (recommended value: `10`). +:param maxLocalProjDBSize: The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run + (recommended value: `3200`). +:return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) + - `freq: Long` + +>>> from pyspark.ml.fpm import PrefixSpan +>>> from pyspark.sql import Row +>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]), --- End diff -- We should keep doctest examples simple to read. For example, including `maxLocalProjDBSize` is not useful because we don't expect users to tuning this param often. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192000950 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -53,7 +53,7 @@ final class PrefixSpan(@Since("2.4.0") override val uid: String) extends Params @Since("2.4.0") val minSupport = new DoubleParam(this, "minSupport", "The minimal support level of the " + "sequential pattern. Sequential pattern that appears more than " + -"(minSupport * size-of-the-dataset)." + +"(minSupport * size-of-the-dataset)" + --- End diff -- Need a space at the end before "times". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192001886 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(JavaParams): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). +This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns` +method to run the PrefixSpan algorithm. + +@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining +(Wikipedia) +.. versionadded:: 2.4.0 + +""" + +minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " + + "sequential pattern. Sequential pattern that appears more than " + + "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.", + typeConverter=TypeConverters.toFloat) + +maxPatternLength = Param(Params._dummy(), "maxPatternLength", + "The maximal length of the sequential pattern. Must be > 0.", + typeConverter=TypeConverters.toInt) + +maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize", + "The maximum number of items (including delimiters used in the " + + "internal storage format) allowed in a projected database before " + + "local processing. If a projected database exceeds this size, " + + "another iteration of distributed prefix growth is run. " + + "Must be > 0.", + typeConverter=TypeConverters.toInt) + +sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " + +"dataset, rows with nulls in this column are ignored.", +typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +__init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +super(PrefixSpan, self).__init__() +self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid) +self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence") +kwargs = self._input_kwargs +self.setParams(**kwargs) + +@keyword_only +@since("2.4.0") +def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +kwargs = self._input_kwargs +return self._set(**kwargs) + +@since("2.4.0") +def findFrequentSequentialPatterns(self, dataset): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. --- End diff -- We should use a SQL type here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192002416 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(JavaParams): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). +This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns` +method to run the PrefixSpan algorithm. + +@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining +(Wikipedia) +.. versionadded:: 2.4.0 + +""" + +minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " + + "sequential pattern. Sequential pattern that appears more than " + + "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.", + typeConverter=TypeConverters.toFloat) + +maxPatternLength = Param(Params._dummy(), "maxPatternLength", + "The maximal length of the sequential pattern. Must be > 0.", + typeConverter=TypeConverters.toInt) + +maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize", + "The maximum number of items (including delimiters used in the " + + "internal storage format) allowed in a projected database before " + + "local processing. If a projected database exceeds this size, " + + "another iteration of distributed prefix growth is run. " + + "Must be > 0.", + typeConverter=TypeConverters.toInt) + +sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " + +"dataset, rows with nulls in this column are ignored.", +typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +__init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +super(PrefixSpan, self).__init__() +self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid) +self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence") +kwargs = self._input_kwargs +self.setParams(**kwargs) + +@keyword_only +@since("2.4.0") +def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, + sequenceCol="sequence"): +""" +setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=3200, \ + sequenceCol="sequence") +""" +kwargs = self._input_kwargs +return self._set(**kwargs) + +@since("2.4.0") +def findFrequentSequentialPatterns(self, dataset): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. +:return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) + - `freq: Long` + +>>> from pyspark.ml.fpm import PrefixSpan +>>> from pyspark.sql import Row +>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]), +... Row(sequence=[[1], [3, 2], [1, 2]]), +... Row(sequence=[[1, 2], [5]]), +... Row(sequence=[[6]])]).toDF() +>>> prefixSpan = PrefixSpan(minSupport=0.5, maxPatternLength=5, +... maxLocalProjDBSize=3200) --- End diff -- remove th
[GitHub] spark issue #21466: [MINOR][YARN] Add YARN-specific credential providers in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21466 cc @jerryshao, not a big deal but I thought it's useful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21466: [MINOR][YARN] Add YARN-specific credential provid...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/21466 [MINOR][YARN] Add YARN-specific credential providers in debug logging message ## What changes were proposed in this pull request? This PR adds a debugging log for YARN-specific credential providers which is loaded by service loader mechanism. It took me a while to debug if it's actually loaded or not. I had to explicitly set the deprecated configuration and check if that's actually being loaded. ## How was this patch tested? The change scope is manually tested. Logs are like: ``` Using the following builtin delegation token providers: hadoopfs, hive, hbase. Using the following YARN-specific credential providers: yarn-test. ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark minor-log Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21466.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21466 commit 5a2b7c7c7fa5fc7ae669858c76a5dd43adfdf966 Author: hyukjinkwon Date: 2018-05-31T06:31:19Z Add YARN-specific credential providers in debug logging message --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21381 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21381 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3718/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21381 **[Test build #91326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91326/testReport)** for PR 21381 at commit [`54b3b5f`](https://github.com/apache/spark/commit/54b3b5f1c973638d4c32d2b297c8b7c9ff72f28a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r192000596 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(JavaParams): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). +This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns` +method to run the PrefixSpan algorithm. + +@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining +(Wikipedia) +.. versionadded:: 2.4.0 + +""" + +minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " + + "sequential pattern. Sequential pattern that appears more than " + + "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.", + typeConverter=TypeConverters.toFloat) + +maxPatternLength = Param(Params._dummy(), "maxPatternLength", + "The maximal length of the sequential pattern. Must be > 0.", + typeConverter=TypeConverters.toInt) + +maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize", + "The maximum number of items (including delimiters used in the " + + "internal storage format) allowed in a projected database before " + + "local processing. If a projected database exceeds this size, " + + "another iteration of distributed prefix growth is run. " + + "Must be > 0.", + typeConverter=TypeConverters.toInt) --- End diff -- Just test that python 'int' type range is the same with java 'long' type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21381: [SPARK-24330][SQL]Refactor ExecuteWriteTask and Use `whi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21381 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/21333#discussion_r192000399 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -154,6 +154,13 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext { } } + test("SPARK-23778: empty RDD in union should not produce a UnionRDD") { --- End diff -- Have we tested when all input RDDs are empty? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91321/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21456 **[Test build #91321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)** for PR 21456 at commit [`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21464 **[Test build #4191 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4191/testReport)** for PR 21464 at commit [`90c9ddc`](https://github.com/apache/spark/commit/90c9ddca2ecb458ccde2945ab67548403c3b4256). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21265 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3717/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21265 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r191996249 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(JavaParams): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). +This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns` +method to run the PrefixSpan algorithm. + +@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining +(Wikipedia) +.. versionadded:: 2.4.0 + +""" + +minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " + + "sequential pattern. Sequential pattern that appears more than " + + "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.", + typeConverter=TypeConverters.toFloat) + +maxPatternLength = Param(Params._dummy(), "maxPatternLength", + "The maximal length of the sequential pattern. Must be > 0.", + typeConverter=TypeConverters.toInt) + +maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize", + "The maximum number of items (including delimiters used in the " + + "internal storage format) allowed in a projected database before " + + "local processing. If a projected database exceeds this size, " + + "another iteration of distributed prefix growth is run. " + + "Must be > 0.", + typeConverter=TypeConverters.toInt) --- End diff -- There isn't `TypeConverters.toLong`, do I need to add it ? My idea is that `TypeConverters.toInt` also fit to Long type in python side so I do not add it for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21265 **[Test build #91325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91325/testReport)** for PR 21265 at commit [`0be3a94`](https://github.com/apache/spark/commit/0be3a94d27f4203608ef82d2ef197b37606c53b3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r191995667 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(object): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830";>here). + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def findFrequentSequentialPatterns(dataset, + sequenceCol, + minSupport, + maxPatternLength, + maxLocalProjDBSize): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. +:param sequenceCol: The name of the sequence column in dataset, rows with nulls in this +column are ignored. +:param minSupport: The minimal support level of the sequential pattern, any pattern that + appears more than (minSupport * size-of-the-dataset) times will be + output (recommended value: `0.1`). +:param maxPatternLength: The maximal length of the sequential pattern + (recommended value: `10`). +:param maxLocalProjDBSize: The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run + (recommended value: `3200`). +:return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) + - `freq: Long` + +>>> from pyspark.ml.fpm import PrefixSpan +>>> from pyspark.sql import Row +>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]), --- End diff -- I think it is better to be put in a example. @mengxr What do you think ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21362: [SPARK-24197][SparkR][FOLLOWUP] Fixing failing te...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21362#discussion_r191994811 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1504,15 +1504,16 @@ test_that("column functions", { expect_equal(result, "cba") # Test array_sort() and sort_array() - df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 5L, NA, 4L + df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 5L, NA, 4L + as_integer_lists <- function(x) lapply(x, lapply, as.integer) - result <- collect(select(df, array_sort(df[[1]])))[[1]] - expect_equal(result, list(list(1L, 2L, 3L, NA), list(4L, 5L, 6L, NA, NA))) + result <- as_integer_lists(collect(select(df, array_sort(df[[1]])))[[1]]) --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21454: [SPARK-24337][Core] Improve error messages for Sp...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21454 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user ifilonenko commented on the issue: https://github.com/apache/spark/pull/21092 Will resolve comments today @mccheah --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20276: [SPARK-14948][SQL] disambiguate attributes in join condi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20276 Hi, @cloud-fan . If this PR is still valid, could you resolve the conflicts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21454 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91318/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21454 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21454 **[Test build #91318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91318/testReport)** for PR 21454 at commit [`38ffa3e`](https://github.com/apache/spark/commit/38ffa3e551c7eee69d12cc736d33d137abd333b7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21362: [SPARK-24197][SparkR][FOLLOWUP] Fixing failing te...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21362#discussion_r191986143 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1504,15 +1504,16 @@ test_that("column functions", { expect_equal(result, "cba") # Test array_sort() and sort_array() - df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 5L, NA, 4L + df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 5L, NA, 4L + as_integer_lists <- function(x) lapply(x, lapply, as.integer) - result <- collect(select(df, array_sort(df[[1]])))[[1]] - expect_equal(result, list(list(1L, 2L, 3L, NA), list(4L, 5L, 6L, NA, NA))) + result <- as_integer_lists(collect(select(df, array_sort(df[[1]])))[[1]]) --- End diff -- `as_integer_lists` doesn't seem like the right approach - that just basically changes everything returned into integer (even though it might not be returned that way from JVM) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91319/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 **[Test build #91319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91319/testReport)** for PR 21366 at commit [`260d82c`](https://github.com/apache/spark/commit/260d82ca9fbbd16ad8174d0dafa2f95bc177a219). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21362: [SPARK-24197][SparkR][FOLLOWUP] Fixing failing tests for...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21362 ok, let us know if you have more information on this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21453 I filed an issue in Scala community about the interface changes, and they said those REPL apis are intended to be private. https://github.com/scala/bug/issues/10913 Being said that, they gave us couple ways to walk around it, and I'm testing it now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/21455 Simply making these fields publicly accessible seems a little weird from Spark's side. Maybe we can use reflection instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21454 LGTM Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21346 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3716/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21346 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/21453 Scala 2.11.12 cannot be built against with current Spark, due to some method changes in REPL. We have tried internally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21346 **[Test build #91324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91324/testReport)** for PR 21346 at commit [`83c3271`](https://github.com/apache/spark/commit/83c3271d2f45bbef18d865bddbc6807e9fbd2503). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21454 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21454 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91316/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21456 From my understanding, that option is only available with G1GC, which is not really a good fit for spark (forget the exact details but something about humongous allocations which are common with all the large byte buffers typical for spark). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21454 **[Test build #91316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91316/testReport)** for PR 21454 at commit [`b7ff38f`](https://github.com/apache/spark/commit/b7ff38f16c91b7df326f49d1f821b14a6dc82e8d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21346 > is this effectively dead code at this point? yes, thats right. this PR by itself is not useful. Its a step towards https://github.com/apache/spark/pull/21451 This is a good point to put in the PR summary -- I'll do that, and also your summary notes above, if you don't mind. > what are the major risks of this change in terms of introducing performance or correctness issues? If we identify risks (e.g. "this is a historically tricky area of code?"), can we mitigate those risks through correctness testing / load testing? I've made an effort to make minimal modifications to all existing code paths, to minimize the risk of introducing bugs in current functionality. My intention is to only turn it on by default initially for cases we know would fail with the old code -- when the data is > 2gb ([SPARK-24297](https://issues.apache.org/jira/browse/SPARK-24297)). I've added unit tests and shared the test I'm doing on a cluster just to find holes in functionality (posted on the parent jira here: https://issues.apache.org/jira/browse/SPARK-6235?focusedCommentId=16484069&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16484069). I have not done load testing yet but plan to. Extra testing, of course, would certainly be good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191981552 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java --- @@ -38,15 +38,24 @@ * * This method will not be called in parallel for a single TransportClient (i.e., channel). * + * The rpc *might* included a data stream in streamData (eg. for uploading a large + * amount of data which should not be buffered in memory here). Any errors while handling the + * streamData will lead to failing this entire connection -- all other in-flight rpcs will fail. --- End diff -- pretty good question actually :) I will take a closer look at this myself but I believe this connection is shared by other tasks running on the same executor which are trying to talk to the same destination. So that might mean another task which is replicating to the same destination, or reading data from that same remote executor. those don't have specific retry behavior for connection closed -- that might result in the data just not getting replicated, fetching data from elsewhere, or the task getting retried. I think this is actually OK -- the existing code could cause an OOM on the remote end anyway, which obviously would fail a lot more. This failure behavior seems reasonable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21010 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3715/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21400: [SPARK-24351][SS]offsetLog/commitLog purge thresh...
Github user ivoson commented on a diff in the pull request: https://github.com/apache/spark/pull/21400#discussion_r191980682 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala --- @@ -34,7 +34,8 @@ class ContinuousSuiteBase extends StreamTest { new SparkContext( "local[10]", "continuous-stream-test-sql-context", - sparkConf.set("spark.sql.testkey", "true"))) + sparkConf.set("spark.sql.testkey", "true") +.set("spark.sql.streaming.minBatchesToRetain", "2"))) --- End diff -- @jose-torres Thanks for pointing this out. I will make a separate suite for clarity and isolation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21445: [SPARK-24404][SS] Increase currentEpoch when meet a Epoc...
Github user advancedxy commented on the issue: https://github.com/apache/spark/pull/21445 > I think the best way to do it is to make the shuffle writer responsible for incrementing the epoch within its task, the same way the data source writer does currently. Yeah, @LiangchangZ please consider this way. The writer part of a task is responsible to pull data from upstream. It's more consistent and wouldn't break existing logic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21010 **[Test build #91323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91323/testReport)** for PR 21010 at commit [`9ccb648`](https://github.com/apache/spark/commit/9ccb6488f6f8309e0cfa71c4b332e6d680f24ffa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21010 LGTM pending Jenkins. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21010 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21010 Thanks, sounds good. Let me retrigger the build just for checking again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191979425 --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/UploadStream.java --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.protocol; + +import java.io.IOException; +import java.nio.ByteBuffer; + +import com.google.common.base.Objects; +import io.netty.buffer.ByteBuf; + +import org.apache.spark.network.buffer.ManagedBuffer; +import org.apache.spark.network.buffer.NettyManagedBuffer; + +/** + * An RPC with data that is sent outside of the frame, so it can be read as a stream. + */ +public final class UploadStream extends AbstractMessage implements RequestMessage { + /** Used to link an RPC request with its response. */ + public final long requestId; + public final ManagedBuffer meta; + public final long bodyByteCount; + + public UploadStream(long requestId, ManagedBuffer meta, ManagedBuffer body) { +super(body, false); // body is *not* included in the frame +this.requestId = requestId; +this.meta = meta; +bodyByteCount = body.size(); + } + + // this version is called when decoding the bytes on the receiving end. The body is handled + // separately. + private UploadStream(long requestId, ManagedBuffer meta, long bodyByteCount) { +super(null, false); +this.requestId = requestId; +this.meta = meta; +this.bodyByteCount = bodyByteCount; + } + + @Override + public Type type() { return Type.UploadStream; } + + @Override + public int encodedLength() { +// the requestId, meta size, meta and bodyByteCount (body is not included) +return 8 + 4 + ((int) meta.size()) + 8; + } + + @Override + public void encode(ByteBuf buf) { +buf.writeLong(requestId); +try { + ByteBuffer metaBuf = meta.nioByteBuffer(); + buf.writeInt(metaBuf.remaining()); + buf.writeBytes(metaBuf); +} catch (IOException io) { + throw new RuntimeException(io); +} +buf.writeLong(bodyByteCount); + } + + public static UploadStream decode(ByteBuf buf) { +long requestId = buf.readLong(); +int metaSize = buf.readInt(); +ManagedBuffer meta = new NettyManagedBuffer(buf.readRetainedSlice(metaSize)); +long bodyByteCount = buf.readLong(); +// This is called by the frame decoder, so the data is still null. We need a StreamInterceptor +// to read the data. +return new UploadStream(requestId, meta, bodyByteCount); + } + + @Override + public int hashCode() { +return Objects.hashCode(requestId, body()); + } + + @Override + public boolean equals(Object other) { +if (other instanceof UploadStream) { + UploadStream o = (UploadStream) other; + return requestId == o.requestId && super.equals(o); +} +return false; + } + + @Override + public String toString() { +return Objects.toStringHelper(this) + .add("requestId", requestId) + .add("body", body()) --- End diff -- to be honest, this was also just parroted from other classes -- looking now at implementations of ManagedBuffer, if they have a `toString()` it does something reasonable. Is that actually useful for debugging? maybe not, don't think I ever actually looked at this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191979019 --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/UploadStream.java --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.protocol; + +import java.io.IOException; +import java.nio.ByteBuffer; + +import com.google.common.base.Objects; +import io.netty.buffer.ByteBuf; + +import org.apache.spark.network.buffer.ManagedBuffer; +import org.apache.spark.network.buffer.NettyManagedBuffer; + +/** + * An RPC with data that is sent outside of the frame, so it can be read as a stream. + */ +public final class UploadStream extends AbstractMessage implements RequestMessage { + /** Used to link an RPC request with its response. */ + public final long requestId; + public final ManagedBuffer meta; + public final long bodyByteCount; + + public UploadStream(long requestId, ManagedBuffer meta, ManagedBuffer body) { +super(body, false); // body is *not* included in the frame +this.requestId = requestId; +this.meta = meta; +bodyByteCount = body.size(); + } + + // this version is called when decoding the bytes on the receiving end. The body is handled + // separately. + private UploadStream(long requestId, ManagedBuffer meta, long bodyByteCount) { +super(null, false); +this.requestId = requestId; +this.meta = meta; +this.bodyByteCount = bodyByteCount; + } + + @Override + public Type type() { return Type.UploadStream; } + + @Override + public int encodedLength() { +// the requestId, meta size, meta and bodyByteCount (body is not included) +return 8 + 4 + ((int) meta.size()) + 8; + } + + @Override + public void encode(ByteBuf buf) { +buf.writeLong(requestId); +try { + ByteBuffer metaBuf = meta.nioByteBuffer(); + buf.writeInt(metaBuf.remaining()); + buf.writeBytes(metaBuf); +} catch (IOException io) { + throw new RuntimeException(io); +} +buf.writeLong(bodyByteCount); + } + + public static UploadStream decode(ByteBuf buf) { +long requestId = buf.readLong(); +int metaSize = buf.readInt(); +ManagedBuffer meta = new NettyManagedBuffer(buf.readRetainedSlice(metaSize)); +long bodyByteCount = buf.readLong(); +// This is called by the frame decoder, so the data is still null. We need a StreamInterceptor +// to read the data. +return new UploadStream(requestId, meta, bodyByteCount); + } + + @Override + public int hashCode() { +return Objects.hashCode(requestId, body()); --- End diff -- this is a good point. Admittedly I just copied this from `StreamResponse` without thinking about it too much -- that class exhibits the same issue. I'll remove `body` from both. (In practice, we're not using sticking them in hashmaps now so there wouldn't be any bugs in behavior because of this.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191978545 --- Diff: common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java --- @@ -141,26 +141,14 @@ public void fetchChunk( StreamChunkId streamChunkId = new StreamChunkId(streamId, chunkIndex); handler.addFetchRequest(streamChunkId, callback); -channel.writeAndFlush(new ChunkFetchRequest(streamChunkId)).addListener(future -> { --- End diff -- yes exactly. Marcelo asked for this refactoring in his review -- there was already a ton of copy-paste, and instead of adding more made sense to refactor. Shouldn't be any behavior change (there are minor changes that shouldn't matter ... `channel.close()` happens before the more specific cleanup operations whereas it was in the middle previously, the `try` encompasses a bit more than before.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191978140 --- Diff: common/network-common/src/main/java/org/apache/spark/network/client/StreamInterceptor.java --- @@ -50,16 +52,22 @@ @Override public void exceptionCaught(Throwable cause) throws Exception { -handler.deactivateStream(); +deactivateStream(); callback.onFailure(streamId, cause); } @Override public void channelInactive() throws Exception { -handler.deactivateStream(); +deactivateStream(); callback.onFailure(streamId, new ClosedChannelException()); } + private void deactivateStream() { +if (handler instanceof TransportResponseHandler) { --- End diff -- the only purpose of `TransportResponseHandler.deactivateStream()` is to include the stream request in the count for `numOutstandingRequests` (its not doing any critical cleanup). I will include a comment here explaining that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191976952 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/StreamData.java --- @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.server; + +import java.io.IOException; +import java.nio.ByteBuffer; + +import org.apache.spark.network.client.RpcResponseCallback; +import org.apache.spark.network.client.StreamCallback; +import org.apache.spark.network.client.StreamInterceptor; +import org.apache.spark.network.util.TransportFrameDecoder; + +/** + * A holder for streamed data sent along with an RPC message. + */ +public class StreamData { + + private final TransportRequestHandler handler; + private final TransportFrameDecoder frameDecoder; + private final RpcResponseCallback rpcCallback; + private final ByteBuffer meta; --- End diff -- whoops, you're right. I was using this at one point in the follow-on patch, then changed it and didn't fully clean this up. thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21464 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91314/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21464 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21464 **[Test build #91314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91314/testReport)** for PR 21464 at commit [`90c9ddc`](https://github.com/apache/spark/commit/90c9ddca2ecb458ccde2945ab67548403c3b4256). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21464 **[Test build #4191 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4191/testReport)** for PR 21464 at commit [`90c9ddc`](https://github.com/apache/spark/commit/90c9ddca2ecb458ccde2945ab67548403c3b4256). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21447: [SPARK-24339][SQL]Add project for transform/map/reduce s...
Github user xdcjie commented on the issue: https://github.com/apache/spark/pull/21447 @maropu @gatorsmile Do you have any comment/suggestion for this PR? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21454 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91315/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21454 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21454 **[Test build #91315 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91315/testReport)** for PR 21454 at commit [`c1b4d16`](https://github.com/apache/spark/commit/c1b4d1670fe1bb5c2b9b0d66c14e3da26627d29e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21453 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21453 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91322/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21453 **[Test build #91322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91322/testReport)** for PR 21453 at commit [`90d3842`](https://github.com/apache/spark/commit/90d3842616aec94d603a68d44463eb043c5a66f9). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21456 **[Test build #91321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)** for PR 21456 at commit [`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21453 **[Test build #91322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91322/testReport)** for PR 21453 at commit [`90d3842`](https://github.com/apache/spark/commit/90d3842616aec94d603a68d44463eb043c5a66f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21453 Jenkins, test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21456 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21453 Jenkins, add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21010 > Basically LGTM, but I'm wondering what if the expr2 is not like a format string? The same as Hive: ```sql spark-sql> SELECT format_number(12332.123456, 'abc'); abc12332 ``` ```sql hive> SELECT format_number(12332.123456, 'abc'); OK abc12332 Time taken: 0.218 seconds, Fetched: 1 row(s) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocal...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21437#discussion_r191969840 --- Diff: python/pyspark/taskcontext.py --- @@ -88,3 +89,9 @@ def taskAttemptId(self): TaskAttemptID. """ return self._taskAttemptId + +def getLocalProperty(self, key): +""" +Get a local property set upstream in the driver, or None if it is missing. --- End diff -- Makes sense but +1 for leaving it out since I either don't know how commonly it will be used for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21378: [SPARK-24326][Mesos] add support for local:// sch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21378#discussion_r191969654 --- Diff: docs/running-on-mesos.md --- @@ -753,6 +753,16 @@ See the [configuration page](configuration.html) for information on Spark config spark.cores.max is reached ++ --- End diff -- +1 for reverting whitespaces. Let's leave this minimised and targeted. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21378: [SPARK-24326][Mesos] add support for local:// scheme for...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21378 @felixcheung yup. I was just looking at this PR out of my curiosity. I don't currently have an env to test Mesos. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21426: [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21426 @vanzin and @jerryshao, thanks you so much. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21464 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21464 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91313/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21464: [WEBUI] Avoid possibility of script in query param keys
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21464 **[Test build #91313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91313/testReport)** for PR 21464 at commit [`aad159c`](https://github.com/apache/spark/commit/aad159c561094b53a719c8950fa087dacd1d9d8d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20697 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3581/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20697: [SPARK-23010][k8s] Initial checkin of k8s integra...
Github user ssuchter commented on a diff in the pull request: https://github.com/apache/spark/pull/20697#discussion_r191966223 --- Diff: resource-managers/kubernetes/integration-tests/scripts/setup-integration-test-env.sh --- @@ -0,0 +1,91 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +TEST_ROOT_DIR=$(git rev-parse --show-toplevel) +UNPACKED_SPARK_TGZ="$TEST_ROOT_DIR/target/spark-dist-unpacked" +IMAGE_TAG_OUTPUT_FILE="$TEST_ROOT_DIR/target/image-tag.txt" +DEPLOY_MODE="minikube" +IMAGE_REPO="docker.io/kubespark" +IMAGE_TAG="N/A" +SPARK_TGZ="N/A" + +# Parse arguments +while (( "$#" )); do + case $1 in +--unpacked-spark-tgz) + UNPACKED_SPARK_TGZ="$2" + shift + ;; +--image-repo) + IMAGE_REPO="$2" + shift + ;; +--image-tag) + IMAGE_TAG="$2" + shift + ;; +--image-tag-output-file) + IMAGE_TAG_OUTPUT_FILE="$2" + shift + ;; +--deploy-mode) + DEPLOY_MODE="$2" + shift + ;; +--spark-tgz) + SPARK_TGZ="$2" + shift + ;; +*) + break + ;; + esac + shift +done + +if [[ $SPARK_TGZ == "N/A" ]]; +then + echo "Must specify a Spark tarball to build Docker images against with --spark-tgz." && exit 1; --- End diff -- Ok, it's important for me to be clear here. There are currently two PRBs. This will continue in the immediate future. 1. General Spark PRB, mainly for unit tests. This can run on all hosts. 2. K8s integration-specific PRB. This early-outs on many PRs that don't seem relevant. This is specifically for running K8s integration tests, and can only run on some hosts. Because of the host restriction issue, these are two separate PRBs. It is definitely true that each one of these will build the main Spark jars separately, so that 11 minute time will be spent twice. Since the K8s-integration PRB is only doing this on a small set of PRs, it's not a significant cost to the Jenkins infrastructure. Within the K8s-integration PRB, the entire maven reactor is only built once, during the make distribution step. The integration test step doesn't rebuild it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20697 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3714/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20697 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20697 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3581/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20697: [SPARK-23010][k8s] Initial checkin of k8s integra...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/20697#discussion_r191965147 --- Diff: resource-managers/kubernetes/integration-tests/scripts/setup-integration-test-env.sh --- @@ -0,0 +1,91 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +TEST_ROOT_DIR=$(git rev-parse --show-toplevel) +UNPACKED_SPARK_TGZ="$TEST_ROOT_DIR/target/spark-dist-unpacked" +IMAGE_TAG_OUTPUT_FILE="$TEST_ROOT_DIR/target/image-tag.txt" +DEPLOY_MODE="minikube" +IMAGE_REPO="docker.io/kubespark" +IMAGE_TAG="N/A" +SPARK_TGZ="N/A" + +# Parse arguments +while (( "$#" )); do + case $1 in +--unpacked-spark-tgz) + UNPACKED_SPARK_TGZ="$2" + shift + ;; +--image-repo) + IMAGE_REPO="$2" + shift + ;; +--image-tag) + IMAGE_TAG="$2" + shift + ;; +--image-tag-output-file) + IMAGE_TAG_OUTPUT_FILE="$2" + shift + ;; +--deploy-mode) + DEPLOY_MODE="$2" + shift + ;; +--spark-tgz) + SPARK_TGZ="$2" + shift + ;; +*) + break + ;; + esac + shift +done + +if [[ $SPARK_TGZ == "N/A" ]]; +then + echo "Must specify a Spark tarball to build Docker images against with --spark-tgz." && exit 1; --- End diff -- Sorry, I mean if the Spark PRB will try to build the entire maven reactor twice - once for unit tests and once for integration tests. The TGZ bundling in and of itself I agree should be fast if the jars are already built by the maven reactor. But it's unclear to me if we'll end up building jars redundantly here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/21346 Summary of key changes (WIP; notes to self): > Summary of changes: > > - Introduce a new `UploadStream` RPC which is sent to push a large payload as a stream (in contrast, the pre-existing `StreamRequest` and `StreamResponse` RPCs are used for pull-based streaming). > - Generalize `RpcHandler.receive()` to support requests which contain streams. > - Generalize `StreamInterceptor` to handle both request and response messages (previously it only handled responses). > - Introduce `StdChannelListener` to abstract away common logging logic in `ChannelFuture` listeners. Question: is this effectively dead code at this point? In other words, this PR just adds the lower-level pieces but there's nothing currently using the new API? So this patch as of now has no behavior change and actual functional changes impacting queries / actual usage will come later when we wire this up to the block replicator? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21346: [SPARK-6237][NETWORK] Network-layer changes to al...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191964329 --- Diff: project/MimaExcludes.scala --- @@ -36,6 +36,9 @@ object MimaExcludes { // Exclude rules for 2.4.x lazy val v24excludes = v23excludes ++ Seq( +// [SPARK-6237][NETWORK] Network-layer changes to allow stream upload + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.network.netty.NettyBlockRpcServer.receive"), --- End diff -- I suspect that it's because we might want to access these across Java package boundaries and Java doesn't have the equivalent of Scala's nested package scoped `private[package]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20697 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20697 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91320/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org