[GitHub] spark issue #21465: [SPARK-24333][ML][PYTHON]Add fit with validation set to ...

2018-12-07 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21465 merged to master, thanks @huaxingao ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate ...

2018-12-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22273 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate ...

2018-12-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22273 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate ...

2018-12-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22273 restest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-06 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239569077 --- Diff: python/pyspark/ml/param/shared.py --- @@ -814,3 +814,25 @@ def getDistanceMeasure(self): """

[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23236 > I suspect that might because as the resource usage is heavy, StreamingLogisticRegressionWithSGD's training speed on input batch stream can't always catch up predict batch stream. So the mo

[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send u...

2018-12-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22275 merged to master, thanks @holdenk @viirya and @felixcheung ! --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23236 True, the test is not that long under light resources. Locally, I saw a couple seconds difference with the changes I mentioned. The weird thing is the unmodified test completes after the 11th

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239245388 --- Diff: python/pyspark/ml/regression.py --- @@ -705,12 +710,59 @@ def getNumTrees(self): return self.getOrDefault(self.numTrees

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239240113 --- Diff: python/pyspark/ml/regression.py --- @@ -705,12 +710,59 @@ def getNumTrees(self): return self.getOrDefault(self.numTrees

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239243661 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239242316 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239243683 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239211515 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23236 Seems ok to me, but there are a few silly things with this test that might help also * why is the `stepSize` so low at 0.01? I think it would be fine at 0.1, but even conservatively

[GitHub] spark pull request #23203: [SPARK-26252][PYTHON] Add support to run specific...

2018-12-04 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/23203#discussion_r238868565 --- Diff: python/run-tests.py --- @@ -93,17 +93,18 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): "py

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-04 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r238801573 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-04 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r238808440 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-04 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r238809338 --- Diff: python/pyspark/ml/classification.py --- @@ -1242,40 +1255,36 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-04 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r238801256 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-04 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r238809091 --- Diff: python/pyspark/ml/regression.py --- @@ -650,19 +650,20 @@ def getFeatureSubsetStrategy(self): return self.getOrDefault

[GitHub] spark issue #23200: [SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small clea...

2018-12-03 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23200 merged to master, thanks @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #23200: [SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Sma...

2018-12-03 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/23200#discussion_r238454041 --- Diff: python/pyspark/mllib/tests/test_linalg.py --- @@ -22,33 +22,18 @@ from numpy import array, array_equal, zeros, arange, tile, ones, inf

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

2018-11-30 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22954 > @BryanCutler BTW, do you know the rough expected timing for Arrow 0.12.0 release? I think we should be starting the release process soon, so maybe in a week or

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-11-27 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r236880776 --- Diff: python/pyspark/ml/regression.py --- @@ -705,12 +705,38 @@ def getNumTrees(self): return self.getOrDefault(self.numTrees

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-11-27 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r236881042 --- Diff: python/pyspark/ml/regression.py --- @@ -1030,9 +1056,9 @@ def featureImportances(self): @inherit_doc -class GBTRegressor

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-21 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r235524421 --- Diff: python/pyspark/worker.py --- @@ -22,7 +22,12 @@ import os import sys import time -import resource +# 'resource

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-21 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r235523238 --- Diff: docs/configuration.md --- @@ -189,7 +189,7 @@ of the most common options to set are: limited to this amount. If not set, Spark

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-11-20 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r235128413 --- Diff: python/pyspark/ml/classification.py --- @@ -1176,8 +1176,8 @@ def trees(self): @inherit_doc class GBTClassifier(JavaEstimator

[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...

2018-11-18 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23077 Oh, I think the PR title should be SPARK-26105 too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...

2018-11-18 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23077 >BTW, Bryan, do you have some time to work on the has_numpy stuff Yup, I can do that --- - To unsubscribe, e-m

[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...

2018-11-18 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23077 Oops, actually I think there is one more here https://github.com/apache/spark/blob/master/python/pyspark/testing/mllibutils.py#L20 Other than that, looks good

[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...

2018-11-16 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23063 cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews

[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...

2018-11-16 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23063 Dist by line count: ``` 348 ./test_algorithms.py 84 ./test_base.py 71 ./test_evaluation.py 314 ./test_feature.py 118 ./test_image.py 392 ./test_linalg.py 367

[GitHub] spark pull request #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests...

2018-11-16 Thread BryanCutler
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/23063 [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py file into smaller files ## What changes were proposed in this pull request? This PR breaks down the large ml/tests.py file

[GitHub] spark pull request #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/te...

2018-11-15 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/23056#discussion_r234093063 --- Diff: python/pyspark/testing/mllibutils.py --- @@ -0,0 +1,44 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-15 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23056 cc @HyukjinKwon @squito --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-15 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23056 Dist by line count: ``` 313 ./test_algorithms.py 201 ./test_feature.py 642 ./test_linalg.py 197 ./test_stat.py 523 ./test_streaming_algorithms.py 115 ./test_util.py

[GitHub] spark pull request #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/te...

2018-11-15 Thread BryanCutler
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/23056 [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py file into smaller files ## What changes were proposed in this pull request? This PR breaks down the large mllib/tests.py file

[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...

2018-11-15 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23034 > Also, @BryanCutler, I think we can talk about locations of testing/...util.py later when we finished to split the tests. Moving utils would probably cause less conflicts and should be g

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23033 Looks like ML is using `QuietTest` also, so the import needs to be updated --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-13 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r233209385 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,

[GitHub] spark pull request #23021: [SPARK-26032][PYTHON] Break large sql/tests.py fi...

2018-11-13 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/23021#discussion_r233183572 --- Diff: python/pyspark/testing/sqlutils.py --- @@ -0,0 +1,268 @@ +# --- End diff -- Maybe rename this file to `sql_testing_utils.py

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

2018-11-09 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22954 I don't know R well enough to review that code, but the results look awesome! Nice work @HyukjinKwon!! --- - To unsubscribe

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232425279 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232425031 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging

[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...

2018-11-09 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22913 Sounds good, thanks @javierluraschi ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send u...

2018-11-08 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22275 ping @HyukjinKwon and @viirya to maybe take another look at the recent changes to make this cleaner, if you are able to. Thanks

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

2018-11-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r232145973 --- Diff: python/pyspark/sql/tests.py --- @@ -4923,6 +4923,28 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

2018-11-06 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r231311398 --- Diff: python/pyspark/sql/tests.py --- @@ -4923,6 +4923,28 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf

[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...

2018-11-05 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22913 I'm a little against adding this because the Arrow Java Vectors used so far were done to match the internal data of Spark, to keep things simple and avoid lots of conversions on the Java side

[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230953015 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

2018-10-30 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r229522939 --- Diff: python/pyspark/sql/tests.py --- @@ -4923,6 +4923,28 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf

[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-10-30 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22275 Apologies for the delay in circling back to this. I reorganized a little to simplify and expanded the comments to hopefully better describe the code. A quick summary of the changes: I

[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-29 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22871 Thanks @HyukjinKwon , looks good! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark issue #22795: [SPARK-25798][PYTHON] Internally document type conversio...

2018-10-24 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22795 merged to master, thanks @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-24 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r227878740 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -63,7 +65,7 @@ private[spark] object PythonEvalType

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-24 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22795#discussion_r227875311 --- Diff: python/pyspark/sql/functions.py --- @@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, functionType=None

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22795#discussion_r227593281 --- Diff: python/pyspark/sql/functions.py --- @@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, functionType=None

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22795#discussion_r227592794 --- Diff: python/pyspark/sql/functions.py --- @@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, functionType=None

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r227582390 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -63,7 +65,7 @@ private[spark] object PythonEvalType

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r227579686 --- Diff: python/pyspark/sql/tests.py --- @@ -6323,6 +6333,33 @@ def ordered_window(self): def unpartitioned_window(self): return

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r227579436 --- Diff: python/pyspark/sql/tests.py --- @@ -6481,12 +6516,116 @@ def test_invalid_args(self): foo_udf = pandas_udf(lambda x: x

[GitHub] spark pull request #22807: [WIP][SPARK-25811][PySpark] Raise a proper error ...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22807#discussion_r227571481 --- Diff: python/pyspark/serializers.py --- @@ -248,7 +248,14 @@ def create_array(s, t): # TODO: see ARROW-2432. Remove when

[GitHub] spark pull request #22807: [WIP][SPARK-25811][PySpark] Raise a proper error ...

2018-10-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22807#discussion_r227572701 --- Diff: python/pyspark/sql/tests.py --- @@ -4961,6 +4961,31 @@ def foofoo(x, y): ).collect ) +def

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22795#discussion_r227093067 --- Diff: python/pyspark/sql/functions.py --- @@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, functionType=None

[GitHub] spark issue #22655: [SPARK-25666][PYTHON] Internally document type conversio...

2018-10-16 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22655 Thanks @viirya ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-09 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r223774544 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -63,7 +65,7 @@ private[spark] object PythonEvalType

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r223507242 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -63,7 +65,7 @@ private[spark] object PythonEvalType

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r223505747 --- Diff: python/pyspark/worker.py --- @@ -154,6 +154,47 @@ def wrapped(*series): return lambda *a: (wrapped(*a), arrow_return_type

[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...

2018-10-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r223506840 --- Diff: python/pyspark/worker.py --- @@ -154,6 +154,47 @@ def wrapped(*series): return lambda *a: (wrapped(*a), arrow_return_type

[GitHub] spark issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregati...

2018-10-08 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22305 I think there is a typo in your example in the description ``` @pandas_udf('double', PandasUDFType.GROUPED_AGG) def avg(v): return v.mean() return avg ``` I think

[GitHub] spark pull request #22653: [SPARK-25659][PYTHON][TEST] Test type inference s...

2018-10-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22653#discussion_r223475373 --- Diff: python/pyspark/sql/tests.py --- @@ -1149,6 +1149,75 @@ def test_infer_schema(self): result = self.spark.sql("SELECT l[0].a

[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-10-05 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22275 Thanks for the review @holdenk ! I haven't had time to followup, but I'll take a look through this and see what I can do about making things clearer

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

2018-10-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r223116201 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3279,34 +3280,33 @@ class Dataset[T] private[sql]( val timeZoneId

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

2018-10-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r223116082 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3279,34 +3280,33 @@ class Dataset[T] private[sql]( val timeZoneId

[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

2018-10-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22610#discussion_r223070065 --- Diff: python/pyspark/sql/functions.py --- @@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None, functionType=None): can fail

[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22610 > It is pretty new one, is it said we need to upgrade to latest PyArrow in order to use it? Since it is an option at Table.from_pandas, is it possible to extend it to pyarrow.Ar

[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22610 > Thanks, @BryanCutler. WDYT about documenting the type map thing? I think that would help in the cases of dates/times because those can get a little confusing. For primitives, I th

[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22610 So pyarrow just added an option when converting from Pandas to raise an error for unsafe casts. I'd have to try it out to see if it would prevent this case though. It's a common option when

[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22610#discussion_r222501309 --- Diff: python/pyspark/worker.py --- @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type): arrow_return_type = to_arrow_type

[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22610 Thanks for looking into this @viirya ! I agree that there are lots of cases where casting to another type is intentional and works fine, so this isn't a bug. The only other idea I have

[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

2018-10-02 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22610#discussion_r222007837 --- Diff: python/pyspark/worker.py --- @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type): arrow_return_type = to_arrow_type

[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-25 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22540#discussion_r220283006 --- Diff: python/pyspark/worker.py --- @@ -97,8 +97,9 @@ def verify_result_length(*a): def wrap_grouped_map_pandas_udf(f, return_type

[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-25 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22540#discussion_r220274047 --- Diff: python/pyspark/worker.py --- @@ -97,8 +97,9 @@ def verify_result_length(*a): def wrap_grouped_map_pandas_udf(f, return_type

[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-25 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22540#discussion_r220272980 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -131,11 +131,8 @@ object ArrowUtils { } else

[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-09-20 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22275 > generally, is this going to limit how much data to pass along because of the bit length of the index? So the index passed to python is the RecordBatch index, not an element in

[GitHub] spark issue #22477: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error w...

2018-09-20 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22477 Thanks @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22477: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error w...

2018-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22477 cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews

[GitHub] spark issue #22477: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error w...

2018-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22477 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #22477: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test ...

2018-09-19 Thread BryanCutler
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/22477 [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23 ## What changes were proposed in this pull request? Fix test that constructs a Pandas

[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22275 @holdenk I was wondering if you had any thoughts on this? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #20908: [SPARK-23672][PYTHON] Document support for nested return...

2018-09-10 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20908 merged to master and branch-2.4, thanks @holdenk ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-10 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22140 > Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version. Yes, I know. It seemed to me at the t

[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-10 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22140 > Can we just simply take this out from branch-2.3? Thanks @HyukjinKwon , that is fine with me. What do you think @gatorsm

[GitHub] spark pull request #22369: [SPARK-25072][DOC] Update migration guide for beh...

2018-09-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22369#discussion_r216147674 --- Diff: docs/sql-programming-guide.md --- @@ -1901,6 +1901,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see

[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-08 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22140 @gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in `DataFrames`, so it did not seem like there was any possible

[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22140 merged to master, branch 2.4 and 2.3. Thanks @xuanyuanking ! --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22140 > yea, to me it looks less sense actually but seems at least working for now: good point, I guess it only fails when you supply a sch

[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-06 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22329 merged to branch-2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

  1   2   3   4   5   6   7   8   9   10   >