[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23063 Let me leave a cc for @holdenk, @MLnick, @jkbradley and @mengxr FYI. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 @zsxwing sure. Sorry that I rushed. Will do next time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 I think now it should be good timing to match the behaviours. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23056 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23056 @BryanCutler, let me merge this. Let's do the ML one and then clean up both comments throughout ML and MLlib at once. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 related another try https://github.com/apache/spark/pull/13252 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 One try to add some tests for reading/writing empty dataframes was here https://github.com/apache/spark/pull/13253 fyi --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 Which should be ... this https://github.com/apache/spark/pull/12855 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23054 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23056 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23056 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22309 adding @liancheng BTW. IIRC, he took a look for this one before and abandoned the change (fix me if I'm wrongly remembering this). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r234086569 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) // each python worker gets an equal part of the allocation. the worker pool will grow to the // number of concurrent tasks, which is determined by the number of cores in this executor. - private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + private val memoryMb = if (Utils.isWindows) { --- End diff -- > JVM could set the request This is handled in JVM so it wouldn't break. `worker` itself is strongly coupled to JVM. You mean that case when the client is in Windows machine and it uses a Unix-based clusters, right? I think this is what the fix already does - the `PythonRunner`s already are created at executor side and it wouldn't affect when the client is on Windows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r234081475 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) // each python worker gets an equal part of the allocation. the worker pool will grow to the // number of concurrent tasks, which is determined by the number of cores in this executor. - private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + private val memoryMb = if (Utils.isWindows) { --- End diff -- I see. I think the point of view is a bit different. What I was trying to do is that: we declare this configuration is not supported on Windows, meaning we disable this configuration on Windows from the start, JVM side - because it's JVM to launch Python workers. So, I was trying to leave the control to JVM. > It seems brittle to disable this on the JVM side and rely on it here. Can we also set a flag in the ImportError case and also check that here? However, in a way, It's a bit odd to say it's brittle because we're already relying on that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23056 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/te...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23056#discussion_r234080468 --- Diff: python/pyspark/mllib/tests/test_linalg.py --- @@ -0,0 +1,642 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import sys +import array as pyarray + +from numpy import array, array_equal, zeros, arange, tile, ones, inf +from numpy import sum as array_sum + +if sys.version_info[:2] <= (2, 6): +try: +import unittest2 as unittest +except ImportError: +sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') +sys.exit(1) +else: +import unittest + +import pyspark.ml.linalg as newlinalg +from pyspark.mllib.linalg import Vector, SparseVector, DenseVector, VectorUDT, _convert_to_vector, \ +DenseMatrix, SparseMatrix, Vectors, Matrices, MatrixUDT +from pyspark.mllib.regression import LabeledPoint +from pyspark.testing.mllibutils import make_serializer, MLlibTestCase + +_have_scipy = False +try: +import scipy.sparse +_have_scipy = True +except: +# No SciPy, but that's okay, we'll skip those tests +pass + + +ser = make_serializer() + + +def _squared_distance(a, b): +if isinstance(a, Vector): +return a.squared_distance(b) +else: +return b.squared_distance(a) + + +class VectorTests(MLlibTestCase): + +def _test_serialize(self, v): +self.assertEqual(v, ser.loads(ser.dumps(v))) +jvec = self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.loads(bytearray(ser.dumps(v))) +nv = ser.loads(bytes(self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.dumps(jvec))) +self.assertEqual(v, nv) +vs = [v] * 100 +jvecs = self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.loads(bytearray(ser.dumps(vs))) +nvs = ser.loads(bytes(self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.dumps(jvecs))) +self.assertEqual(vs, nvs) + +def test_serialize(self): +self._test_serialize(DenseVector(range(10))) +self._test_serialize(DenseVector(array([1., 2., 3., 4.]))) +self._test_serialize(DenseVector(pyarray.array('d', range(10 +self._test_serialize(SparseVector(4, {1: 1, 3: 2})) +self._test_serialize(SparseVector(3, {})) +self._test_serialize(DenseMatrix(2, 3, range(6))) +sm1 = SparseMatrix( +3, 4, [0, 2, 2, 4, 4], [1, 2, 1, 2], [1.0, 2.0, 4.0, 5.0]) +self._test_serialize(sm1) + +def test_dot(self): +sv = SparseVector(4, {1: 1, 3: 2}) +dv = DenseVector(array([1., 2., 3., 4.])) +lst = DenseVector([1, 2, 3, 4]) +mat = array([[1., 2., 3., 4.], + [1., 2., 3., 4.], + [1., 2., 3., 4.], + [1., 2., 3., 4.]]) +arr = pyarray.array('d', [0, 1, 2, 3]) +self.assertEqual(10.0, sv.dot(dv)) +self.assertTrue(array_equal(array([3., 6., 9., 12.]), sv.dot(mat))) +self.assertEqual(30.0, dv.dot(dv)) +self.assertTrue(array_equal(array([10., 20., 30., 40.]), dv.dot(mat))) +self.assertEqual(30.0, lst.dot(dv)) +self.assertTrue(array_equal(array([10., 20., 30., 40.]), lst.dot(mat))) +self.assertEqual(7.0, sv.dot(arr)) + +def test_squared_distance(self): +sv = SparseVector(4, {1: 1, 3: 2}) +dv = DenseVector(array([1., 2., 3., 4.])) +lst = DenseVector([4, 3, 2, 1]) +lst1 = [4, 3, 2, 1] +arr = pyarray.array('d', [0, 2, 1, 3]) +narr = array([0, 2, 1, 3]) +self.assertEqual(15.0, _squared_distance(sv, dv)) +self.assertEqual(25.0, _squared_distance(sv, lst)) +self.assertEqual(20.0, _squared_distan
[GitHub] spark pull request #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/te...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23056#discussion_r234080249 --- Diff: python/pyspark/testing/mllibutils.py --- @@ -0,0 +1,44 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import sys + +if sys.version_info[:2] <= (2, 6): +try: +import unittest2 as unittest +except ImportError: +sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') +sys.exit(1) +else: +import unittest --- End diff -- @BryanCutler, actually I remove this because we dropped 2.6 support while we are here. Im pretty sure we can just import unittest. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23046: [SPARK-23207][SQL][FOLLOW-UP] Use `SQLConf.get.en...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23046#discussion_r234073703 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala --- @@ -280,7 +280,7 @@ object ShuffleExchangeExec { } // The comparator for comparing row hashcode, which should always be Integer. val prefixComparator = PrefixComparators.LONG - val canUseRadixSort = SparkEnv.get.conf.get(SQLConf.RADIX_SORT_ENABLED) + val canUseRadixSort = SQLConf.get.enableRadixSort --- End diff -- Yes .. I don't mind it but was just thinking that we don't necessarily backport to all the branches if there's any concern. I will leave it to you guys as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][SQL] Disable 'spark.executor.pyspark.memor...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23055 cc @rdblue, @vanzin and @haydenjeune --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][SQL] Disable 'spark.executor.pyspar...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23055 [SPARK-26080][SQL] Disable 'spark.executor.pyspark.memory' always on Windows ## What changes were proposed in this pull request? `resource` package is a Unit specific package. See https://docs.python.org/2/library/resource.html and https://docs.python.org/3/library/resource.html. Note that we document Windows support: > Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). This should be backported into branch-2.4 to restore Windows support in Spark 2.4.1. ## How was this patch tested? Manually mocking the changed logics. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26080 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23055.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23055 commit 2d3315a7dab429abc4d9ef5ed7f8f5484e8421f1 Author: hyukjinkwon Date: 2018-11-16T01:46:31Z Disable 'spark.executor.pyspark.memory' on Windows always --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 Thank you @BryanCutler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23046: [SPARK-23207][SQL][FOLLOW-UP] Use `SQLConf.get.en...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23046#discussion_r234063905 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala --- @@ -280,7 +280,7 @@ object ShuffleExchangeExec { } // The comparator for comparing row hashcode, which should always be Integer. val prefixComparator = PrefixComparators.LONG - val canUseRadixSort = SparkEnv.get.conf.get(SQLConf.RADIX_SORT_ENABLED) + val canUseRadixSort = SQLConf.get.enableRadixSort --- End diff -- @ueshin, BTW, for clarification, it does read the configuration but does not respect when the configuration is given to the session, right? I think we don't need to backport this through all other branches. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23052#discussion_r234062564 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -174,13 +174,18 @@ private[csv] class CsvOutputWriter( context: TaskAttemptContext, params: CSVOptions) extends OutputWriter with Logging { - private val charset = Charset.forName(params.charset) + private var univocityGenerator: Option[UnivocityGenerator] = None - private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - - private val gen = new UnivocityGenerator(dataSchema, writer, params) + override def write(row: InternalRow): Unit = { +val gen = univocityGenerator.getOrElse { --- End diff -- Also, one thing we should not forget about is, CSV _could_ have headers even if the records are empty. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 @MaxGekk, actually this is kind of important behaviour change. This basically means we're unable to read the empty files back. Similar changes were proposed in Parquet few years ago (by me) and reverted. We should better investigate and match the behaviours first across datasources. IIRC, ORC does not create files (if that's not updated from what I have checked long ago) but Parquet does. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23047: [BACKPORT][SPARK-25883][SQL][MINOR] Override method `pre...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23047 Merged to branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 Also, @BryanCutler, I think we can talk about locations of `testing/...util.py` later when we finished to split the tests. Moving utils would probably cause less conflicts and should be good enough to separately discuss if that's a worry, and should be changed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 @BryanCutler, should be ready to work on ML and MLlib as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23034: [SPARK-26035][PYTHON] Break large streaming/tests...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23034#discussion_r233829511 --- Diff: python/pyspark/testing/streamingutils.py --- @@ -0,0 +1,189 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import glob +import os +import tempfile +import time +import unittest + +from pyspark import SparkConf, SparkContext, RDD +from pyspark.streaming import StreamingContext + + +def search_kinesis_asl_assembly_jar(): +kinesis_asl_assembly_dir = os.path.join( +os.environ["SPARK_HOME"], "external/kinesis-asl-assembly") + +# We should ignore the following jars +ignored_jar_suffixes = ("javadoc.jar", "sources.jar", "test-sources.jar", "tests.jar") + +# Search jar in the project dir using the jar name_prefix for both sbt build and maven +# build because the artifact jars are in different directories. +name_prefix = "spark-streaming-kinesis-asl-assembly" +sbt_build = glob.glob(os.path.join( +kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix)) +maven_build = glob.glob(os.path.join( +kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix)) +jar_paths = sbt_build + maven_build +jars = [jar for jar in jar_paths if not jar.endswith(ignored_jar_suffixes)] + +if not jars: +return None +elif len(jars) > 1: +raise Exception(("Found multiple Spark Streaming Kinesis ASL assembly JARs: %s; please " + "remove all but one") % (", ".join(jars))) +else: +return jars[0] + + +# Must be same as the variable and condition defined in KinesisTestUtils.scala and modules.py +kinesis_test_environ_var = "ENABLE_KINESIS_TESTS" +should_skip_kinesis_tests = not os.environ.get(kinesis_test_environ_var) == '1' +kinesis_asl_assembly_jar = search_kinesis_asl_assembly_jar() + +if should_skip_kinesis_tests: --- End diff -- I simplified this logic and tested each if-else branch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 Will go and merge this tomorrow if there's no outstanding issues. cc @zsxwing and @tdas. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 Merged to master. Thanks @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23033 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23033 I am merging this for the same reason with #23021. Let me know if there's any concern even after this got merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23033 @BryanCutler, looks we should add `pyspark.ml.tests` at https://github.com/apache/spark/blob/master/python/run-tests.py#L252-L253 so that we can run unittests first over doc tests (because arguably unittests take longer then doctests). I think it's missed when `ml/tests.py` was added. For instance, the latest above took it took few minutes longer then usual because the ml tests ran at the last. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20788#discussion_r233678942 --- Diff: python/pyspark/sql/tests/test_dataframe.py --- @@ -375,6 +375,19 @@ def test_generic_hints(self): plan = df1.join(df2.hint("broadcast"), "id")._jdf.queryExecution().executedPlan() self.assertEqual(1, plan.toString().count("BroadcastHashJoin")) +# add tests for SPARK-23647 (test more types for hint) +def test_extended_hint_types(self): +from pyspark.sql import DataFrame + +df = self.spark.range(10e10).toDF("id") +such_a_nice_list = ["itworks1", "itworks2", "itworks3"] --- End diff -- @DylanGuedes, what do we get if `dict` is given? looks not being tested. If that produces a weird result, we can disallow it here for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hint in p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20788 Thanks. @DylanGuedes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23014 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23039: [SPARK-26066][SQL] Moving truncatedString to sql/catalys...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23039 @MaxGekk, I think the main purpose of this PR is rather to introduce `spark.sql.debug.maxToStringFields` .. let's fix PR description and title for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 Yup, will address the other comments and update the PR accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23033 Yup will do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23034 I haven't tested the kinesis logic yet. I will check it via Jenkins. Line counts: ``` 751 ./test_dstream.py 89 ./test_kinesis.py 158 ./test_listener.py 184 ./test_context.py ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23034 [WIP][SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/streaming/tests.py` into ...: ``` pyspark âââ __init__.py ... âââ streaming â  âââ __init__.py ... â  âââ tests â  â  âââ __init__.py â  â  âââ test_context.py â  â  âââ test_dstream.py â  â  âââ test_kinesis.py â  â  âââ test_listener.py ... âââ testing ... â  âââ streamingutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26035 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23034.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23034 commit 6c02003b947a97944f6a48e57be5b7c98dbda896 Author: hyukjinkwon Date: 2018-11-14T15:15:56Z [SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21914: [SPARK-24967][SQL] Avro: Use internal.Logging instead fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21914 Please ask that to the mailing list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23033 Rough line distributions look like this: ``` 237 ./test_serializers.py 739 ./test_rdd.py 499 ./test_readwrite.py 69 ./test_join.py 161 ./test_taskcontext.py 43 ./test_conf.py 122 ./test_broadcast.py 80 ./test_daemon.py 86 ./test_util.py 157 ./test_worker.py 112 ./test_profiler.py 181 ./test_shuffle.py 258 ./test_context.py 248 ./test_appsubmit.py ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23033 cc'ing @BryanCutler and @squito. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23033: [SPARK-26036][PYTHON] Break large tests.py files ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23033 [SPARK-26036][PYTHON] Break large tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/tests.py` into ...: ``` pyspark ... âââ testing ... â  âââ utils.py âââ tests â  âââ __init__.py â  âââ test_appsubmit.py â  âââ test_broadcast.py â  âââ test_conf.py â  âââ test_context.py â  âââ test_daemon.py â  âââ test_join.py â  âââ test_profiler.py â  âââ test_rdd.py â  âââ test_readwrite.py â  âââ test_serializers.py â  âââ test_shuffle.py â  âââ test_taskcontext.py â  âââ test_util.py â  âââ test_worker.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26036 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23033.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23033 commit 08cb59eeda3fd3b1042013f72f6fc45ea1146bd1 Author: hyukjinkwon Date: 2018-11-14T10:16:13Z Break large tests.py files into smaller files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 I am merging this in - maybe I am rushing it but please allow me to go ahead since it's going to block other PySpark PRs. At worst case, I am willing to revert and propose this again if there's something obviously wrong. Please feel free to leave some comments even after this is merged. I would appreciate it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 Ah .. right makes sense to me. Thanks @shaneknapp. +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 adding @holdenk, @ueshin and @icexelloss as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 adding @icexelloss as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 > Did you test on python3 as well? Of course! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r233292436 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") --- End diff -- I think it's a bug because it always produces a corrupt value when I try to use `number` as explicit float types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 @shaneknapp, do you roughly know how difficult it is (and do you have some time shortly) to upgrade R from 3.1 to 3.4? I am asking this because I had some difficulties when I tried to manually upgrade from a certain low version to another non-latest version. If it's expected to take a while, let's go deprecation step. If that's expected to be less difficult, let's go saying unsupporting way. Does this sound okay to you @felixcheung? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r233290797 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- At least Python version was noted here before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 > Could you add some descriptions to run a single test file or a single test case if exists? Done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 Yup! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22994 I haven't taken a look super closely but the idea looks itself okay. Is it urgent? if yes, yup. I don't object to go ahead right away. Otherwise, might be good to leave open for few days for review comments .. Let me leave some cc's for @srowen, @felixcheung, @holdenk .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 I am going to push after testing and double checking. The line counts would look like this ``` 54 ./test_utils.py 199 ./test_catalog.py 503 ./test_grouped_agg_pandas_udf.py 45 ./test_group.py 320 ./test_session.py 153 ./test_readwriter.py 806 ./test_scalar_pandas_udf.py 216 ./test_pandas_udf.py 566 ./test_streaming.py 55 ./test_conf.py 16 ./__init__.py 530 ./test_grouped_map_pandas_udf.py 157 ./test_column.py 654 ./test_udf.py 262 ./test_window_pandas_udf.py 278 ./test_functions.py 263 ./test_context.py 138 ./test_serde.py 170 ./test_datasources.py 399 ./test_arrow.py 96 ./test_appsubmit.py 944 ./test_types.py 737 ./test_dataframe.py ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 > I'd break the pandas udf one into smaller pieces too, as you suggested. We should also investigate why the runtime didn't improve ... One suspection from my investigation is, it requires to stop and start the context for each test which is costly. I also expected it's going to slightly decrease the time but actually it looks that slightly increased the time (I guess 2 ~ 3 mins in total? - shouldn't be a big deal). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23021: [SPARK-26032][PYTHON] Break large sql/tests.py fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23021#discussion_r233269827 --- Diff: python/pyspark/testing/sqlutils.py --- @@ -0,0 +1,268 @@ +# --- End diff -- Yea, similar thought. One thing is though testing util itself has some dependencies to each others (for instance, `from pyspark.tests import ReusedPySparkTestCase`). So, I thought it's okay to make it separate. Another benefit from doing so is we can simply refer the example (NumPy) as is. As you already know, NumPy has been there more then 15 years and should be mature enough. It should be okay to trust their approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 Yup, will break pandas one into smaller ones as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22962 Also please fix the test. The test doesn't really look clear. I actually quite didn't like the test written here now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r233131375 --- Diff: python/pyspark/taskcontext.py --- @@ -147,8 +147,8 @@ def __init__(self): @classmethod def _getOrCreate(cls): """Internal function to get or create global BarrierTaskContext.""" -if cls._taskContext is None: -cls._taskContext = BarrierTaskContext() +if not isinstance(cls._taskContext, BarrierTaskContext): +cls._taskContext = object.__new__(cls) --- End diff -- Also, next time please fully describe what's going on in PR description. Even I was confused about it and misread that `__new__` is actually being inherited. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r233130494 --- Diff: python/pyspark/taskcontext.py --- @@ -147,8 +147,8 @@ def __init__(self): @classmethod def _getOrCreate(cls): """Internal function to get or create global BarrierTaskContext.""" -if cls._taskContext is None: -cls._taskContext = BarrierTaskContext() +if not isinstance(cls._taskContext, BarrierTaskContext): +cls._taskContext = object.__new__(cls) --- End diff -- Also, we should remove __init__ too. That's what Python interpretor will implicitly insert for both. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r233130221 --- Diff: python/pyspark/taskcontext.py --- @@ -147,8 +147,8 @@ def __init__(self): @classmethod def _getOrCreate(cls): """Internal function to get or create global BarrierTaskContext.""" -if cls._taskContext is None: -cls._taskContext = BarrierTaskContext() +if not isinstance(cls._taskContext, BarrierTaskContext): +cls._taskContext = object.__new__(cls) --- End diff -- Can we get rid of the rewrite all? It's never a good idea to overwrite them unless it's absolutely required. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 Elapsed time looks virtually same. All tests looks running fine. The last commit should show skipped tests fine as well. Should be ready for a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23004#discussion_r233012392 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -237,6 +237,13 @@ case class InMemoryTableScanExec( if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty => list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] && l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _) + +case StartsWith(a: AttributeReference, ExtractableLiteral(l)) => + statsFor(a).lowerBound.substr(0, Length(l)) <= l && +l <= statsFor(a).upperBound.substr(0, Length(l)) +case StartsWith(ExtractableLiteral(l), a: AttributeReference) => --- End diff -- BTW, a.startswith(b) and b.startswith(a) are not same but why are they same here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r233009506 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -104,6 +106,14 @@ class UnivocityParser( requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray } + private val decimalParser = if (SQLConf.get.legacyDecimalParsing) { +(s: String) => new BigDecimal(s.replaceAll(",", "")) --- End diff -- @MaxGekk, looks we should do the same thing in schema inference path as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 To all, so how about we start the fix @wangyum tried before? If we are generally agreed upon the direction itself, upgrading Hive to 2.3 (or 3), I would like to encourage him to continue #20659. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 The test failure itself doesn't look caused by this change. The tests will fail anyway with a different error message. If the goal is really just to check if the tests pass or not, you should use `com.github.hyukjinkwon` artifact instead of `org.spark-project.hive` in Spark which I released under my personal domain to test this, as I did above. We should fix the hive issue we're discussing right now first so that we can safely merge this PR and verify that Hadoop 3 profile works fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r232991503 --- Diff: python/pyspark/taskcontext.py --- @@ -147,8 +147,8 @@ def __init__(self): @classmethod def _getOrCreate(cls): """Internal function to get or create global BarrierTaskContext.""" -if cls._taskContext is None: -cls._taskContext = BarrierTaskContext() +if not isinstance(cls._taskContext, BarrierTaskContext): +cls._taskContext = object.__new__(cls) --- End diff -- BTW, I think we should just `BarrierTaskContext()`. Let's don't make it complicated next time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22962 The main code change LGTM too in any event --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r232990319 --- Diff: python/pyspark/tests.py --- @@ -618,10 +618,13 @@ def test_barrier_with_python_worker_reuse(self): """ Verify that BarrierTaskContext.barrier() with reused python worker. """ +self.sc._conf.set("spark.python.work.reuse", "true") --- End diff -- Yup. sorry for late response. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 For your information, here's the line counts for each file: ``` 52 ./test_utils.py 197 ./test_catalog.py 43 ./test_group.py 318 ./test_session.py 151 ./test_readwriter.py 2180 ./test_pandas_udf.py 564 ./test_streaming.py 53 ./test_conf.py 155 ./test_column.py 652 ./test_udf.py 276 ./test_functions.py 261 ./test_context.py 136 ./test_serde.py 168 ./test_datasources.py 397 ./test_arrow.py 94 ./test_appsubmit.py 942 ./test_types.py 735 ./test_dataframe.py ``` It's rather evenly distributed, except for `./test_pandas_udf.py`. Maybe we should make `test_scalar_pandas_udf.py` and etc. but I would like to avoid this for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 FWIW, I at least double checked if they are any tests missing, and if they are actually being ran (via coverage). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 adding @rxin (derived from mailing list) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23021 @BryanCutler and @squito, Here is the official first attempt to break `pyspark/sql/tests.py` into multiple small files. If there are no outstanding issues (for instance, if we are agreed upon the high level approach), I would like to push this quicker as this will super likely have conflicts easily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23021: [SPARK-26032][PYTHON] Break large sql/tests.py fi...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23021 [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files ## What changes were proposed in this pull request? This is the official first attempt to break huge single `tests.py` file - I did it locally before few times and gave up for some reasons. Now, currently it really makes the unittests super hard to read and difficult to check. To me, it even bothers me to to scroll down the big file. It's one single 7000 lines file! This is not only readability issue. Since one big test takes most of tests time, the tests don't run in parallel fully, causing the test time increases. We could pick up one example and follow. Given my investigation, the current style looks closer to NumPy structure and looks easier to follow. Please see https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/sql/tests.py` into ...: ```bash pyspark ... âââ sql ... â  âââ tests # Includes all tests broken down from 'pyspark/sql/tests.py' â â â # Each matchs to module in 'pyspark/sql'. Additionally, some logical group can be added. â â â # For instance, 'test_arrow.py', 'test_datasources.py', 'test_serde.py' and 'test_submit.py' â â â # were added. â  â  âââ __init__.py â  â  âââ test_arrow.py â  â  âââ test_catalog.py â  â  âââ test_column.py â  â  âââ test_conf.py â  â  âââ test_context.py â  â  âââ test_dataframe.py â  â  âââ test_datasources.py â  â  âââ test_functions.py â  â  âââ test_group.py â  â  âââ test_readwriter.py â  â  âââ test_serde.py â  â  âââ test_session.py â  â  âââ test_streaming.py â  â  âââ test_submit.py â  â  âââ test_types.py â  â  âââ test_udf.py â  â  âââ test_utils.py ... âââ testing # Includes testing utils that can be used in unittests. â  âââ __init__.py â  âââ sqlutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and `./run-tests-with-coverage`. Manually checked they are actually being ran. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25344 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23021.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23021 ---- commit dee09b73757bbad73cbe274d6a9873d8cc143423 Author: hyukjinkwon Date: 2018-11-13T08:41:57Z Break larga sql/tests.py files into smaller files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23020: [MINOR][BUILD] Remove *.crc from .gitignore
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23020 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232895848 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true" + shouldUseArrow <- FALSE + firstRow <- NULL if (is.data.frame(data)) { - # Convert data into a list of rows. Each row is a list. - - # get the names of columns, they will be put into RDD - if (is.null(schema)) { -schema <- names(data) - } +# get the names of columns, they will be put into RDD +if (is.null(schema)) { + schema <- names(data) +} - # get rid of factor type - cleanCols <- function(x) { -if (is.factor(x)) { - as.character(x) -} else { - x -} +# get rid of factor type +cleanCols <- function(x) { + if (is.factor(x)) { +as.character(x) + } else { +x } +} +data[] <- lapply(data, cleanCols) + +args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) +if (arrowEnabled) { + shouldUseArrow <- tryCatch({ --- End diff -- Yup, correct. Let me address other comments as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23006: [SPARK-26007][SQL] DataFrameReader.csv() respects to spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23006 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23014: [MINOR][SQL] Add disable bucketedRead workaround ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23014#discussion_r232893546 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -101,10 +101,11 @@ private void throwUnsupportedException(int requiredCapacity, Throwable cause) { String message = "Cannot reserve additional contiguous bytes in the vectorized reader (" + (requiredCapacity >= 0 ? "requested " + requiredCapacity + " bytes" : "integer overflow") + "). As a workaround, you can reduce the vectorized reader batch size, or disable the " + -"vectorized reader. For parquet file format, refer to " + +"vectorized reader, or disable " + SQLConf.BUCKETING_ENABLED().key() + " if you read " + +"from bucket table. For Parquet file format, refer to " + SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().key() + " (default " + SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().defaultValueString() + -") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for orc file format, " + +") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for Orc file format, " + --- End diff -- `Orc` is `ORC` BTW :-). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23014 > The reason is that each bucket file is too big Can you elaborate please? Is it because we don't chunk each file into multiple splits when we read bucketed table? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23014: [MINOR][SQL] Add disable bucketedRead workaround ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23014#discussion_r232885260 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -101,7 +101,8 @@ private void throwUnsupportedException(int requiredCapacity, Throwable cause) { String message = "Cannot reserve additional contiguous bytes in the vectorized reader (" + (requiredCapacity >= 0 ? "requested " + requiredCapacity + " bytes" : "integer overflow") + "). As a workaround, you can reduce the vectorized reader batch size, or disable the " + -"vectorized reader. For parquet file format, refer to " + +"vectorized reader, or disable " + SQLConf.BUCKETING_ENABLED().key() + " if you read " + +"from bucket table. For parquet file format, refer to " + --- End diff -- parquet -> Parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23018: [SPARK-26023][SQL] Dumping truncated plans and generated...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23018 Looks fine to me. adding @cloud-fan and @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23018: [SPARK-26023][SQL] Dumping truncated plans and ge...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23018#discussion_r232883084 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -469,7 +471,21 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { def treeString: String = treeString(verbose = true) def treeString(verbose: Boolean, addSuffix: Boolean = false): String = { -generateTreeString(0, Nil, new StringBuilder, verbose = verbose, addSuffix = addSuffix).toString +val writer = new StringBuilderWriter() +try { + treeString(writer, verbose, addSuffix, None) + writer.toString +} finally { + writer.close() +} + } + + def treeString( + writer: Writer, + verbose: Boolean, + addSuffix: Boolean, + maxFields: Option[Int]): Unit = { +generateTreeString(0, Nil, writer, verbose, "", addSuffix) --- End diff -- If #22879 is merged first, we should add that function here. If this one is merged first, that PR better have the function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 In this way, we could postpone R upgrade after Spark 3.0.0 release in Jenkins, and could still test the deprecated R version 3.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 Nice. Thanks!. BTW Felix, are you maybe worrying about that we happen to upgrade R version in Jenkins to 3.4 and .. we could break lower deprecated R version support in Spark 3.0 I guess? If so, let me put the version check into both places `generic.R` and `shell.R`. In this way, both shell and submit still show the errors but the tests will pass. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 Thank you all!! ---
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 BTW, let.s test them in end-to-end. For instance, `spark.range(1).rdd.map(lambda blabla).count()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 If the perf diff is big, let's don't change but document that we can use `CloudPickleSerializer()` to avoid breaking change. If the perf diff is rather trivial, let's check if we can keep this change. I will help to check the perf in this case as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 Nope, it should be manually done.. should be great to have it FWIW. I am not yet sure how we're going to measure the performance. I think you can show the performance diff for namedtuple for now - that's going to at the very least show some numbers to compare. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23011: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23011 Merged to master. Thanks, @srowen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232742917 --- Diff: R/pkg/R/sparkR.R --- @@ -283,6 +283,10 @@ sparkR.session <- function( enableHiveSupport = TRUE, ...) { + if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), "3.4.0") == -1) { +warning("R prior to version 3.4 is deprecated as of Spark 3.0.") + } --- End diff -- @felixcheung, I updated the condition to check major version as well. BTW, if we add it in .First in general.R, looks it's not going to print out the warnings in the SparkR shell. When we drop PySpark, we did in a similar place (https://github.com/apache/spark/pull/15733/files#diff-01e513a19a2a6f31ecaabd3dd6ac12d9R192). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 Yea will take a look to address. But about documenting unsupported, if we explicitly are going to say it's unsupported and dropped, for instance, we should remove the compatibility change (https://github.com/apache/spark/blob/master/R/pkg/src-native/string_hash_code.c) and I assume previous versions don't work. Deprecation step might be more concervative and consistent with dropping steps of other language APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22429 Ooops i rished to read. Yea but still sounds related but orthogonal. Let's move it to mailing list. That should be the best place to discuss further. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22429 @boy-uber, for structured streaming, let's do it out of this PR. I think the actual change of this PR can be small (1.). We can change this API for structured streaming later if needed since this is just an internal private API. Change (2.) can be shared I guess even if we go for structured streaming stuff. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org