[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22772 Minor fixes are minor .. no need to rush to get this into 2.4. Let's take a look few more times before going ahead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22770: [SPARK-25771][PYSPARK]Fix improper synchronization in Py...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22770 Please fill `How was this patch tested?` as well in the PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22770: [SPARK-25771][PYSPARK]Fix improper synchronization in Py...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22770 Please feel `How was this patch tested?` as well in the PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22770: [SPARK-25771][PYSPARK]Fix improper synchronizatio...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22770#discussion_r226515494 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -31,15 +32,15 @@ import org.apache.spark.security.SocketAuthHelper import org.apache.spark.util.{RedirectThread, Utils} private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String, String]) - extends Logging { + extends Logging { self => import PythonWorkerFactory._ // Because forking processes from Java is expensive, we prefer to launch a single Python daemon, // pyspark/daemon.py (by default) and tell it to fork new workers for our tasks. This daemon // currently only works on UNIX-based systems now because it uses signals for child management, // so we can also fall back to launching workers, pyspark/worker.py (by default) directly. - val useDaemon = { + private val useDaemon = { --- End diff -- `daemonModule` and `workerModule` too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22770: [SPARK-25771][PYSPARK]Fix improper synchronizatio...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22770#discussion_r226515461 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -31,15 +32,15 @@ import org.apache.spark.security.SocketAuthHelper import org.apache.spark.util.{RedirectThread, Utils} private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String, String]) - extends Logging { + extends Logging { self => import PythonWorkerFactory._ // Because forking processes from Java is expensive, we prefer to launch a single Python daemon, // pyspark/daemon.py (by default) and tell it to fork new workers for our tasks. This daemon // currently only works on UNIX-based systems now because it uses signals for child management, // so we can also fall back to launching workers, pyspark/worker.py (by default) directly. - val useDaemon = { + private val useDaemon = { --- End diff -- Why are we fixing this? Looks not directly related to the fix and this is already private class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22772 I think we better do some proof readings before doing multiple minor followups. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@cont...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22762#discussion_r226377181 --- Diff: python/pyspark/sql/tests.py --- @@ -225,6 +225,63 @@ def sql_conf(self, pairs): else: self.spark.conf.set(key, old_value) +@contextmanager +def database(self, *databases): +""" +A convenient context manager to test with some specific databases. This drops the given +databases if exist and sets current database to "default" when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +try: +yield +finally: +for db in databases: +self.spark.sql("DROP DATABASE IF EXISTS %s CASCADE" % db) +self.spark.catalog.setCurrentDatabase("default") + +@contextmanager +def table(self, *tables): +""" +A convenient context manager to test with some specific tables. This drops the given tables +if exist when it exits. --- End diff -- wait .. not a big deal but typo: `if exist when it exits.`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmana...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22762 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@cont...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22762#discussion_r226267706 --- Diff: python/pyspark/sql/tests.py --- @@ -225,6 +225,55 @@ def sql_conf(self, pairs): else: self.spark.conf.set(key, old_value) +@contextmanager +def database(self, *databases): +""" +A convenient context manager to test with some specific databases. This drops the given +databases if exist and sets current database to "default" when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +if len(databases) == 1 and isinstance(databases[0], (list, set)): --- End diff -- like .. an assert by this condition `all(map(lambda d: isinstance(d, str), databases))` should be good enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@cont...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22762#discussion_r226264798 --- Diff: python/pyspark/sql/tests.py --- @@ -225,6 +225,55 @@ def sql_conf(self, pairs): else: self.spark.conf.set(key, old_value) +@contextmanager +def database(self, *databases): +""" +A convenient context manager to test with some specific databases. This drops the given +databases if exist and sets current database to "default" when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +if len(databases) == 1 and isinstance(databases[0], (list, set)): --- End diff -- hey @ueshin, how about we just allow `database(a, b, ...)` case only for now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmana...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22762 I like it! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22171 Hm, actually I thought this makes sense tho. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22295 Looks close to go. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226184011 --- Diff: python/pyspark/sql/functions.py --- @@ -2713,6 +2713,25 @@ def from_csv(col, schema, options={}): return Column(jc) +@since(3.0) +def _getActiveSession(): --- End diff -- I mean the function itself .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22761: [MINOR][DOC] Spacing items in migration guide for readab...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22761 Of course! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3034: [WIP] ZEPPELIN-3552. Support Scala 2.12 of SparkInterp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3034 > Spark 2.4 will officially support Scala 2.12, so it will be great if Zeppelin will support it together with Spark. And also, there are some libs that are Scala 2.12 only The default distribution will still be with Scala 2.11 for Spark 2.4 if I am not mistaken. It is nice to support it but Spark 2.4 with 2.11 should be supported first as a higher priority. I can work on 2.4.0 with 2.12 support further after this one got merged. ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon closed the pull request at: https://github.com/apache/zeppelin/pull/3206 ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/zeppelin/pull/3206 [WIP][ZEPPELIN-3810] Support Spark 2.4 ### What is this PR for? Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see SPARK-24418). There are two problems for this upgrade at Zeppelin side: 1.. Some methods that are used in private by reflection, for instance, `loopPostInit` became inaccessible. See: - https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala - https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala To work around this, I manually ported `loopPostInit` at 2.11.8 to retain the behaviour. Some functions that are commonly existing at both Scala 2.11.8 and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection. 2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. Otherwise, we will hit: ``` Caused by: java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139) ``` To work around this, I tweaked this by upgrading jline from `2.12.1` to `2.14.3`. ### What type of PR is it? [Improvement] ### Todos * [ ] - Wait until Spark 2.4.0 is officially released. ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3810 ### How should this be tested? Verified manually against Spark 2.4.0 RC3 ### Questions: * Does the licenses files need update? Yes * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3206 commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9 Author: hyukjinkwon Date: 2018-10-17T14:41:29Z Support Spark 2.4 ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/zeppelin/pull/3206 [WIP][ZEPPELIN-3810] Support Spark 2.4 ### What is this PR for? Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see SPARK-24418). There are two problems for this upgrade at Zeppelin side: 1.. Some methods that are used in private by reflection, for instance, `loopPostInit` became inaccessible. See: - https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala - https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala To work around this, I manually ported `loopPostInit` at 2.11.8 to retain the behaviour. Some functions that are commonly existing at both Scala 2.11.8 and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection. 2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. Otherwise, we will hit: ``` Caused by: java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139) ``` To work around this, I tweaked this by upgrading jline from `2.12.1` to `2.14.3`. ### What type of PR is it? [Improvement] ### Todos * [ ] - Wait until Spark 2.4.0 is officially released. ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3810 ### How should this be tested? Verified manually against Spark 2.4.0 RC3 ### Questions: * Does the licenses files need update? Yes * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3206 commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9 Author: hyukjinkwon Date: 2018-10-17T14:41:29Z Support Spark 2.4 ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon closed the pull request at: https://github.com/apache/zeppelin/pull/3206 ---
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226166020 --- Diff: python/pyspark/sql/tests.py --- @@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self): spark.stop() +class SparkSessionTests2(unittest.TestCase): + +def test_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +activeSession = SparkSession.getActiveSession() +df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name']) +self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) +finally: +spark.stop() + +def test_get_active_session_when_no_active_session(self): +active = SparkSession.getActiveSession() +self.assertEqual(active, None) +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +active = SparkSession.getActiveSession() +self.assertEqual(active, spark) +spark.stop() +active = SparkSession.getActiveSession() +self.assertEqual(active, None) + +def test_SparkSession(self): +spark = SparkSession.builder \ +.master("local") \ +.config("some-config", "v2") \ +.getOrCreate() +try: +self.assertEqual(spark.conf.get("some-config"), "v2") +self.assertEqual(spark.sparkContext._conf.get("some-config"), "v2") +self.assertEqual(spark.version, spark.sparkContext.version) +spark.sql("CREATE DATABASE test_db") +spark.catalog.setCurrentDatabase("test_db") +self.assertEqual(spark.catalog.currentDatabase(), "test_db") +spark.sql("CREATE TABLE table1 (name STRING, age INT) USING parquet") +self.assertEqual(spark.table("table1").columns, ['name', 'age']) +self.assertEqual(spark.range(3).count(), 3) +finally: +spark.stop() + +def test_global_default_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +self.assertEqual(SparkSession.builder.getOrCreate(), spark) +finally: +spark.stop() + +def test_default_and_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +activeSession = spark._jvm.SparkSession.getActiveSession() +defaultSession = spark._jvm.SparkSession.getDefaultSession() +try: +self.assertEqual(activeSession, defaultSession) +finally: +spark.stop() + +def test_config_option_propagated_to_existing_SparkSession(self): +session1 = SparkSession.builder \ +.master("local") \ +.config("spark-config1", "a") \ +.getOrCreate() +self.assertEqual(session1.conf.get("spark-config1"), "a") +session2 = SparkSession.builder \ +.config("spark-config1", "b") \ +.getOrCreate() +try: +self.assertEqual(session1, session2) +self.assertEqual(session1.conf.get("spark-config1"), "b") +finally: +session1.stop() + +def test_new_session(self): +session = SparkSession.builder \ +.master("local") \ +.getOrCreate() +newSession = session.newSession() +try: +self.assertNotEqual(session, newSession) +finally: +session.stop() +newSession.stop() + +def test_create_new_session_if_old_session_stopped(self): +session = SparkSession.builder \ +.master("local") \ +.getOrCreate() +session.stop() +newSession = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +self.assertNotEqual(session, newSession) +finally: +newSession.stop() + +def test_active_session_with_None_and_not_None_context(self): +from pyspark.context import SparkContext +from pyspark.conf impo
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226166057 --- Diff: python/pyspark/sql/tests.py --- @@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self): spark.stop() +class SparkSessionTests2(unittest.TestCase): + +def test_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +activeSession = SparkSession.getActiveSession() +df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name']) +self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) +finally: +spark.stop() + +def test_get_active_session_when_no_active_session(self): +active = SparkSession.getActiveSession() +self.assertEqual(active, None) +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +active = SparkSession.getActiveSession() +self.assertEqual(active, spark) +spark.stop() +active = SparkSession.getActiveSession() +self.assertEqual(active, None) + +def test_SparkSession(self): +spark = SparkSession.builder \ +.master("local") \ +.config("some-config", "v2") \ +.getOrCreate() +try: +self.assertEqual(spark.conf.get("some-config"), "v2") +self.assertEqual(spark.sparkContext._conf.get("some-config"), "v2") +self.assertEqual(spark.version, spark.sparkContext.version) +spark.sql("CREATE DATABASE test_db") +spark.catalog.setCurrentDatabase("test_db") +self.assertEqual(spark.catalog.currentDatabase(), "test_db") +spark.sql("CREATE TABLE table1 (name STRING, age INT) USING parquet") +self.assertEqual(spark.table("table1").columns, ['name', 'age']) +self.assertEqual(spark.range(3).count(), 3) +finally: +spark.stop() + +def test_global_default_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +self.assertEqual(SparkSession.builder.getOrCreate(), spark) +finally: +spark.stop() + +def test_default_and_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +activeSession = spark._jvm.SparkSession.getActiveSession() +defaultSession = spark._jvm.SparkSession.getDefaultSession() +try: +self.assertEqual(activeSession, defaultSession) +finally: +spark.stop() + +def test_config_option_propagated_to_existing_SparkSession(self): --- End diff -- Let's just above `SparkSession` -> `spark_session` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226165866 --- Diff: python/pyspark/sql/functions.py --- @@ -2713,6 +2713,25 @@ def from_csv(col, schema, options={}): return Column(jc) +@since(3.0) +def _getActiveSession(): --- End diff -- eh.. why is it in functions.py? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21990 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21990 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22675: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22675 Looks cool otherwise! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226165068 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,90 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. --- End diff -- Shall we consistently make some codes such as `StructType` as codes like `` `StructType` `` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226164867 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,90 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML. --- End diff -- really personal preference tho .. `like` -> `such as` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226164813 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,90 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources --- End diff -- Should it be `Datasource` or `Data sources`? I am saying this because there looks a mismatch with the menu above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226164476 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,90 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. --- End diff -- `.` -> `,`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226164264 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,90 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. +The schema of the `image` column is: + - origin: String (represents the file path of the image) --- End diff -- I would use SQL types consistently, for instance, StringType, IntegerType --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226163959 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. --- End diff -- Where's describing each field? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r226163638 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. + + + +[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) +implements a Spark SQL data source API for loading image data as a DataFrame. + +{% highlight scala %} +scala> spark.read.format("image").load("data/mllib/images/origin") +res1: org.apache.spark.sql.DataFrame = [image: struct] +{% endhighlight %} + + + +[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html) +implements Spark SQL data source API for loading image data as DataFrame. + +{% highlight java %} +Dataset imagesDF = spark.read().format("image").load("data/mllib/images/origin"); +{% endhighlight %} + + + --- End diff -- Shall we add an example for R as well then? It wouldn't be too difficult to add the equivalent examples. Also, I don't think we will add the equivalent examples in different languages at different pages. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r226149299 --- Diff: sql/core/src/test/resources/test-data/cars-crlf.csv --- @@ -0,0 +1,7 @@ + +year,make,model,comment,blank +"2012","Tesla","S","No comment", + +1997,Ford,E350,"Go get one now they are going fast", +2015,Chevy,Volt + --- End diff -- just for clarification, this file has newline variant combinations, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22761: [MINOR][DOC] Spacing items in migration guide for...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22761 [MINOR][DOC] Spacing items in migration guide for readability and consistency ## What changes were proposed in this pull request? Currently, migration guide has no space between each item which looks too compact and hard to read. Some of items already had some spaces between them in the migration guide. This PR suggest to format them consistently for readability. Before: ![screen shot 2018-10-18 at 9 49 12 am](https://user-images.githubusercontent.com/6477701/47126706-4c43da80-d2bc-11e8-966f-ad804ab23174.png) After: ![screen shot 2018-10-18 at 9 53 55 am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png) ## How was this patch tested? Manually tested: You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark minor-migration-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22761.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22761 commit e1d6a0a9985d0351e4dcc2f868141a079694432d Author: hyukjinkwon Date: 2018-10-18T01:56:16Z Format migration guide for readability --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22732#discussion_r226145444 --- Diff: docs/sql-programming-guide.md --- @@ -1978,6 +1978,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. - Since Spark 2.4, The LOAD DATA command supports wildcard `?` and `*`, which match any one character, and zero or more characters, respectively. Example: `LOAD DATA INPATH '/tmp/folder*/'` or `LOAD DATA INPATH '/tmp/part-?'`. Special Characters like `space` also now work in paths. Example: `LOAD DATA INPATH '/tmp/folder name/'`. - In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true` and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row. To restore the previous behavior, set `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` to `true`. + - Since Spark 2.4, use of the method `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` should not expect any automatic null handling of the input parameters, thus a null input of a Scala primitive type will be converted to the type's corresponding default value in the UDF. All other UDF declaration and registration methods remain the same behavior as before. --- End diff -- Does this target to backport to 2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22732#discussion_r226145051 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala --- @@ -393,4 +393,30 @@ class UDFSuite extends QueryTest with SharedSQLContext { checkAnswer(df, Seq(Row("12"), Row("24"), Row("3null"), Row(null))) } } + + test("SPARK-25044 Verify null input handling for primitive types - with udf()") { +val udf1 = udf({(x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0)}) --- End diff -- nit `udf((x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22732#discussion_r226145150 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala --- @@ -393,4 +393,30 @@ class UDFSuite extends QueryTest with SharedSQLContext { checkAnswer(df, Seq(Row("12"), Row("24"), Row("3null"), Row(null))) } } + + test("SPARK-25044 Verify null input handling for primitive types - with udf()") { +val udf1 = udf({(x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0)}) +val df = spark.range(0, 3).toDF("a") + .withColumn("b", udf1($"a", lit(null))) + .withColumn("c", udf1(lit(null), $"a")) + +checkAnswer( + df, + Seq( +Row(0, 1, null), +Row(1, 3, null), +Row(2, 5, null))) + } + + test("SPARK-25044 Verify null input handling for primitive types - with udf.register") { +withTable("t") { + sql("create table t(a varchar(10), b int, c varchar(10)) using parquet") --- End diff -- nit: upper case for keywords --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22732#discussion_r226144670 --- Diff: docs/sql-programming-guide.md --- @@ -1951,7 +1951,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. -Users can use explict cast +Users can use explicit cast --- End diff -- There's the same word mistake right above `explict cast` btw. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon closed the pull request at: https://github.com/apache/zeppelin/pull/3206 ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/zeppelin/pull/3206 [WIP][ZEPPELIN-3810] Support Spark 2.4 ### What is this PR for? Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see SPARK-24418). There are two problems for this upgrade at Zeppelin side: 1.. Some methods that are used in private by reflection, for instance, `loopPostInit` became inaccessible. See: - https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala - https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala To work around this, I manually ported `loopPostInit` at 2.11.8 to retain the behaviour. Some functions that are commonly existing at both Scala 2.11.8 and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection. 2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. Otherwise, we will hit: ``` Caused by: java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139) ``` To work around this, I tweaked this by upgrading jline from `2.12.1` to `2.14.3`. ### What type of PR is it? [Improvement] ### Todos * [ ] - Wait until Spark 2.4.0 is officially released. ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3810 ### How should this be tested? Verified manually against Spark 2.4.0 RC3 ### Questions: * Does the licenses files need update? Yes * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3206 commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9 Author: hyukjinkwon Date: 2018-10-17T14:41:29Z Support Spark 2.4 ---
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21990 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 oops. I haven't. Will check that too while I am here. BTW, my understanding is that we need this one as well since Spark still can be compiled against Scala 2.11.x, am I in the right way? ---
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 This is a WIP. We should wait for Spark 2.4.0. cc @zjffdu and @felixcheung ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/zeppelin/pull/3206 [WIP][ZEPPELIN-3810] Support Spark 2.4 ### What is this PR for? Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see SPARK-24418). There are two problems for this upgrade at Zeppelin side: 1.. Some methods that are used in private by reflection, for instance, `loopPostInit` became inaccessible. See: - https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala - https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala To work around this, I manually ported `loopPostInit` at 2.11.8 to retain the behaviour. Some functions that are commonly existing at both Scala 2.11.8 and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection. 2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. Otherwise, we will hit: ``` Caused by: java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139) ``` To work around this, I tweaked this by upgrading jline from `2.12.1` to `2.14.3`. ### What type of PR is it? [Improvement] ### Todos * [ ] - Wait until Spark 2.4.0 is officially released. ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3810 ### How should this be tested? Verified manually against Spark 2.4.0 RC3 ### Questions: * Does the licenses files need update? Yes * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3206 commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9 Author: hyukjinkwon Date: 2018-10-17T14:41:29Z Support Spark 2.4 ---
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20503 Looks good otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20503#discussion_r225846867 --- Diff: python/pyspark/sql/types.py --- @@ -1581,7 +1581,7 @@ def __repr__(self): return "Row(%s)" % ", ".join("%s=%r" % (k, v) for k, v in zip(self.__fields__, tuple(self))) else: -return "" % ", ".join(self) +return "" % ", ".join("%s" % (fields) for fields in self) --- End diff -- `"%s" % (fields) for fields in self` -> `"%s" % field for field in self` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20503#discussion_r225846669 --- Diff: python/pyspark/sql/types.py --- @@ -1581,7 +1581,7 @@ def __repr__(self): return "Row(%s)" % ", ".join("%s=%r" % (k, v) for k, v in zip(self.__fields__, tuple(self))) else: -return "" % ", ".join(self) +return "" % ", ".join("%s" % (fields) for fields in self) --- End diff -- nit fields => field --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20503#discussion_r225846584 --- Diff: python/pyspark/sql/types.py --- @@ -1581,7 +1581,7 @@ def __repr__(self): return "Row(%s)" % ", ".join("%s=%r" % (k, v) for k, v in zip(self.__fields__, tuple(self))) else: -return "" % ", ".join(self) +return "" % ", ".join("%s" % (fields) for fields in self) --- End diff -- nit `"%s" % (fields) for fields in self` -> `"%s" % fields for fields in self` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20503#discussion_r225846381 --- Diff: python/pyspark/sql/tests.py --- @@ -234,6 +234,10 @@ def test_empty_row(self): row = Row() self.assertEqual(len(row), 0) +def test_row_without_column_name(self): +row = Row("Alice", 11) +self.assertEqual(row.__repr__(), "") --- End diff -- I would test non-ascii compatible characters as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20503#discussion_r225844655 --- Diff: python/pyspark/sql/tests.py --- @@ -234,6 +234,10 @@ def test_empty_row(self): row = Row() self.assertEqual(len(row), 0) +def test_row_without_column_name(self): +row = Row("Alice", 11) --- End diff -- Can we add a doctest for this usage (Row as objects not as a namedtuple class), and documentation in `Row` at `types.py`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22646#discussion_r225820193 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1115,9 +1126,38 @@ object SQLContext { }) } } -def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { - case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) - case _ => CatalystTypeConverters.createToCatalystConverter(dataType) +def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { + case (cls: Class[_], struct: StructType) => --- End diff -- Hm, about we fix them together while we are here? I also checked another difference which is beans without getter and/or setter but I think this is something we should fix in 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22646#discussion_r225819586 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1115,9 +1126,38 @@ object SQLContext { }) } } -def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { - case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) - case _ => CatalystTypeConverters.createToCatalystConverter(dataType) +def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { --- End diff -- Yea .. was just thinking of moving this func to there .. looks ugly that this file getting long. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225819012 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- Actually either way looks okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225818067 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- Eh .. I think it's okay to have a function and returns that updated extensions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22694: [SQL][CATALYST][MINOR] update some error comments
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22694 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22503 @justinuang, okay. Mind rebasing this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22295 @huaxingao, thanks for addressing comments. Would you mind rebasing it and resolving the conflicts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 @rxin and @gatorsmile, WDYT? I already had to argue about Hadoop 3 support here and there (for instance see [SPARK-18112|https://issues.apache.org/jira/browse/SPARK-18112] and [SPARK-18673|https://issues.apache.org/jira/browse/SPARK-18673]), and explain what's going on. Looks ideally we should go ahead 2. (https://github.com/apache/spark/pull/21588#issuecomment-429272279) if I am not mistaken. If there are some more concerns we should address before going ahead, definitely I am willing to help investigating as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 Woah .. let me resolve the conflicts tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 Thanks all! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #122: [LIVY-529][DOCS] Fix supported Python version fro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/122 +1 ---
[GitHub] incubator-livy issue #123: [MINOR] There is some typo in PULL_REQUEST_TEMPLA...
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/123 +1 ---
[GitHub] spark issue #22730: [SPARK-16775][CORE] Remove deprecated accumulator v1 API...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22730 yup, PySpark uses accumulator V2 per https://github.com/apache/spark/commit/90d5754212425d55f992c939a2bc7d9ac6ef92b8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22597 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22694: [SQL][CATALYST][MINOR] update some error comments
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22694 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22597 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22728: [SPARK-25736][SQL][TEST] add tests to verify the behavio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22728 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22597 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22636: [SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22636 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22732#discussion_r225393373 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -39,29 +40,29 @@ import org.apache.spark.sql.types.DataType * @param nullable True if the UDF can return null value. * @param udfDeterministic True if the UDF is deterministic. Deterministic UDF returns same result * each time it is invoked with a particular input. - * @param nullableTypes which of the inputTypes are nullable (i.e. not primitive) */ case class ScalaUDF( function: AnyRef, dataType: DataType, children: Seq[Expression], +handleNullForInputs: Seq[Boolean], --- End diff -- Adding `handleNullForInputs` doesn't look reducing confusion a lot to me. Since this PR targets only refactoring mainly to reduce confusion and the easy of use, this concern should be addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22732#discussion_r225393220 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -39,29 +40,29 @@ import org.apache.spark.sql.types.DataType * @param nullable True if the UDF can return null value. * @param udfDeterministic True if the UDF is deterministic. Deterministic UDF returns same result * each time it is invoked with a particular input. - * @param nullableTypes which of the inputTypes are nullable (i.e. not primitive) */ case class ScalaUDF( function: AnyRef, dataType: DataType, children: Seq[Expression], +handleNullForInputs: Seq[Boolean], --- End diff -- Maybe I missed something but: 1. Why don't we just merge `handleNullForInputs` and `inputTypes`? 2. Why `handleNullForInputs` is required whereas `inputTypes`'s default is `Nil`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 ping @wangyum, if you're willing to make a progress about this, please provide some input here and/or in the JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r225195159 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.csv._ +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * Converts a CSV input string to a [[StructType]] with the specified schema. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(csvStr, schema[, options]) - Returns a struct value with the given `csvStr` and `schema`.", + examples = """ +Examples: + > SELECT _FUNC_('1, 0.8', 'a INT, b DOUBLE'); + {"a":1, "b":0.8} + > SELECT _FUNC_('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/')) + {"time":2015-08-26 00:00:00.0} + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class CsvToStructs( +schema: StructType, +options: Map[String, String], +child: Expression, +timeZoneId: Option[String] = None) + extends UnaryExpression +with TimeZoneAwareExpression +with CodegenFallback +with ExpectsInputTypes +with NullIntolerant { + + override def nullable: Boolean = child.nullable + + // The CSV input data might be missing certain fields. We force the nullability + // of the user-provided schema to avoid data corruptions. + val nullableSchema: StructType = schema.asNullable + + // Used in `FunctionRegistry` + def this(child: Expression, schema: Expression, options: Map[String, String]) = +this( + schema = ExprUtils.evalSchemaExpr(schema), + options = options, + child = child, + timeZoneId = None) + + def this(child: Expression, schema: Expression) = this(child, schema, Map.empty[String, String]) + + def this(child: Expression, schema: Expression, options: Expression) = +this( + schema = ExprUtils.evalSchemaExpr(schema), + options = ExprUtils.convertToMapData(options), + child = child, + timeZoneId = None) + + // This converts parsed rows to the desired output by the given schema. + @transient + lazy val converter = (rows: Iterator[InternalRow]) => { +if (rows.hasNext) { + rows.next() --- End diff -- of course! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/incubator-livy/pull/121#discussion_r225194172 --- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java --- @@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception { @Test public void testSparkSQLJob() throws Exception { -runTest(true, new TestFunction() { +runTest(true, false, new TestFunction() { --- End diff -- This also particularly happens when the session that created Hive client is restarted with Hive support enabled multiple times(?). This was able to be reproduced in the local as well. I am still digging the root cause but so far the scope is specified to this test only. ---
[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/incubator-livy/pull/121#discussion_r225191745 --- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java --- @@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception { @Test public void testSparkSQLJob() throws Exception { -runTest(true, new TestFunction() { +runTest(true, false, new TestFunction() { --- End diff -- I am not very sure on that but that's what's going on here in 2.4.0 - I doubly checked in my local and Travis CI. ---
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 Sorry, I missed that comment. I replied in that comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r225190659 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -777,7 +777,6 @@ case class SchemaOfJson( } object JsonExprUtils { - def evalSchemaExpr(exp: Expression): DataType = exp match { --- End diff -- I was following @MaxGekk's opinion (https://github.com/apache/spark/pull/22379/files/93d094f45b02afc0ab2f0650bbde1513823471a2#r224846183). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r225190224 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.csv._ +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * Converts a CSV input string to a [[StructType]] with the specified schema. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(csvStr, schema[, options]) - Returns a struct value with the given `csvStr` and `schema`.", + examples = """ +Examples: + > SELECT _FUNC_('1, 0.8', 'a INT, b DOUBLE'); + {"a":1, "b":0.8} + > SELECT _FUNC_('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/')) + {"time":2015-08-26 00:00:00.0} + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class CsvToStructs( +schema: StructType, +options: Map[String, String], +child: Expression, +timeZoneId: Option[String] = None) + extends UnaryExpression +with TimeZoneAwareExpression +with CodegenFallback +with ExpectsInputTypes +with NullIntolerant { + + override def nullable: Boolean = child.nullable + + // The CSV input data might be missing certain fields. We force the nullability + // of the user-provided schema to avoid data corruptions. + val nullableSchema: StructType = schema.asNullable + + // Used in `FunctionRegistry` + def this(child: Expression, schema: Expression, options: Map[String, String]) = +this( + schema = ExprUtils.evalSchemaExpr(schema), + options = options, + child = child, + timeZoneId = None) + + def this(child: Expression, schema: Expression) = this(child, schema, Map.empty[String, String]) + + def this(child: Expression, schema: Expression, options: Expression) = +this( + schema = ExprUtils.evalSchemaExpr(schema), + options = ExprUtils.convertToMapData(options), + child = child, + timeZoneId = None) + + // This converts parsed rows to the desired output by the given schema. + @transient + lazy val converter = (rows: Iterator[InternalRow]) => { +if (rows.hasNext) { + rows.next() --- End diff -- Looks rows can't be more then one in this CSV's code path specifically. See below: ```scala val rawParser = new UnivocityParser(actualSchema, actualSchema, parsedOptions) new FailureSafeParser[String]( input => Seq(rawParser.parse(input)), mode, nullableSchema, parsedOptions.columnNameOfCorruptRecord, parsedOptions.multiLine) ``` Univocity parser: ```scala def parse(input: String): InternalRow = convert(tokenizer.parseLine(input)) ``` and in the `FailureSafeParser` ```scala class FailureSafeParser[IN]( rawParser: IN => Seq[InternalRow], mode: ParseMode, schema: StructType, columnNameOfCorruptRecord: String, isMultiLine: Boolean) { ``` ```scala def parse(input: IN): Iterator[InternalRow] = { try { if (skipParsing) { Iterator.single(InternalRow.empty) } else { rawParser.apply(input).toIterator.map(row =>
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r225186492 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.csv._ +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * Converts a CSV input string to a [[StructType]] with the specified schema. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(csvStr, schema[, options]) - Returns a struct value with the given `csvStr` and `schema`.", + examples = """ +Examples: + > SELECT _FUNC_('1, 0.8', 'a INT, b DOUBLE'); + {"a":1, "b":0.8} + > SELECT _FUNC_('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/')) + {"time":2015-08-26 00:00:00.0} + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class CsvToStructs( +schema: StructType, +options: Map[String, String], +child: Expression, +timeZoneId: Option[String] = None) + extends UnaryExpression +with TimeZoneAwareExpression +with CodegenFallback +with ExpectsInputTypes +with NullIntolerant { + + override def nullable: Boolean = child.nullable + + // The CSV input data might be missing certain fields. We force the nullability + // of the user-provided schema to avoid data corruptions. + val nullableSchema: StructType = schema.asNullable + + // Used in `FunctionRegistry` + def this(child: Expression, schema: Expression, options: Map[String, String]) = +this( + schema = ExprUtils.evalSchemaExpr(schema), + options = options, + child = child, + timeZoneId = None) + + def this(child: Expression, schema: Expression) = this(child, schema, Map.empty[String, String]) + + def this(child: Expression, schema: Expression, options: Expression) = +this( + schema = ExprUtils.evalSchemaExpr(schema), + options = ExprUtils.convertToMapData(options), + child = child, + timeZoneId = None) + + // This converts parsed rows to the desired output by the given schema. + @transient + lazy val converter = (rows: Iterator[InternalRow]) => { +if (rows.hasNext) { + rows.next() --- End diff -- Oops, I missed. Let me check --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 I will update after 2.4.0 is officially released. ---
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 cc @vanzin, @mgaido91 (already you're here tho :-) ) and @jerryshao. ---
[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/incubator-livy/pull/121#discussion_r225185417 --- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java --- @@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception { @Test public void testSparkSQLJob() throws Exception { -runTest(true, new TestFunction() { +runTest(true, false, new TestFunction() { --- End diff -- I am running Travis CI against RC3 (https://travis-ci.org/HyukjinKwon/incubator-livy/builds/441687251) ---
[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/incubator-livy/pull/121#discussion_r225185258 --- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java --- @@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception { @Test public void testSparkSQLJob() throws Exception { -runTest(true, new TestFunction() { +runTest(true, false, new TestFunction() { --- End diff -- I am not 100% sure why the test becomes failed. The root cause seems that multiple Hive clients(?) access to derby and metastore as it creates Hive client multiple times(?). I spent a whole day and failed to find why and just decided to fix the test since it looks the problem is specific to the test scope so far. So, I fixed the test `testSparkSQLJob` to use Spark only and `testHiveJob` to use Hive support. This reminds of me that the current issue at SparkR where the tests can't start and stop multiple Hive support sessions (it ended up with one hive support session), ---
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 While addressing the review comments, I also reviewed at the same time. The change looks pretty good to go. For https://github.com/apache/spark/pull/22379#issuecomment-429738031, I guess we can add the string one later at Scala side .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 I wouldn't too, but there's no way for using `schema_of_csv` otherwise .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 Like .. ``` val csvData = "a,1,2" from_csv(column, schema = schema_of_csv(lit(csvData))) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r225030219 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3854,6 +3854,38 @@ object functions { @scala.annotation.varargs def map_concat(cols: Column*): Column = withExpr { MapConcat(cols.map(_.expr)) } + /** + * Parses a column containing a CSV string into a `StructType` with the specified schema. + * Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr { +CsvToStructs(schema, options, e.expr) + } + + /** + * (Java-specific) Parses a column containing a CSV string into a `StructType` + * with the specified schema. Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column = { --- End diff -- Let me address this one tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 2.4 is not yet officially released. I am manually verifying this within my local. ---
[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/incubator-livy/pull/121 [WIP][LIVY-518][BUILD] Support Spark 2.4 ## What changes were proposed in this pull request? This PR proposes to support Spark 2.4 in Livy. ## How was this patch tested? Unit tests and manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/incubator-livy spark24-support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-livy/pull/121.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #121 commit 7393630fed6b5f95ef053a1f31bf7d233dd6c6fe Author: hyukjinkwon Date: 2018-10-15T03:08:18Z [LIVY-518][BUILD] Support Spark 2.4 ---
[GitHub] spark pull request #22697: [SPARK-25700][SQL][BRANCH-2.4] Partially revert a...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/22697 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22597#discussion_r225025622 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { )).get.toString } } + + test("SPARK-25579 ORC PPD should support column names with dot") { +import testImplicits._ + +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") { + withTempDir { dir => +val path = new File(dir, "orc").getCanonicalPath +Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path) +val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2") +checkAnswer(stripSparkFilter(df), Row(1, 2)) --- End diff -- Yup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22597#discussion_r225024900 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { )).get.toString } } + + test("SPARK-25579 ORC PPD should support column names with dot") { +import testImplicits._ + +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") { + withTempDir { dir => +val path = new File(dir, "orc").getCanonicalPath +Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path) --- End diff -- How about explicitly repartition to make separate output files? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22646#discussion_r225021394 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1115,9 +1126,38 @@ object SQLContext { }) } } -def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { - case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) - case _ => CatalystTypeConverters.createToCatalystConverter(dataType) +def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { + case (cls: Class[_], struct: StructType) => --- End diff -- wait .. can we reuse `JavaTypeInference.serializerFor` and make a projection, rather then reimplementing whole logics here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22646#discussion_r225021732 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1115,9 +1126,38 @@ object SQLContext { }) } } -def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { - case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) - case _ => CatalystTypeConverters.createToCatalystConverter(dataType) +def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { + case (cls: Class[_], struct: StructType) => --- End diff -- https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L131-L132 We should drop the support for getter or setter only. adding @cloud-fan here as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22646#discussion_r225016843 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1115,9 +1126,38 @@ object SQLContext { }) } } -def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { - case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) - case _ => CatalystTypeConverters.createToCatalystConverter(dataType) +def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { --- End diff -- BTW, how about we put this method in `CatalystTypeConverters`? Looks it is a Catalyst converter for beans. Few Java types like `java.lang.Iterable`, `java.math.BigDecimal` and `java.math.BigInteger` are being handled there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22527 Looks good to me too! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org