[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023439#comment-17023439 ] Mathew Wicks commented on SPARK-28921: -- [~dongjoon], it's just very bad practice to not update all jars which depend on each other, so I never tried to only do one. However, I also remember reading people who said they encountered errors while only updating one, on other threads about this issue. > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25330: -- Fix Version/s: 2.3.2 > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25330: - Assignee: Yuming Wang > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25330: -- Fix Version/s: 2.4.0 > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25330. --- Resolution: Fixed > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29721: -- Summary: Spark SQL reads unnecessary nested fields after using explode (was: Spark SQL reads unnecessary nested fields from Parquet after using explode) > Spark SQL reads unnecessary nested fields after using explode > - > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Kai Kang >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29721. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26978 [https://github.com/apache/spark/pull/26978] > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Kai Kang >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29721: - Assignee: L. C. Hsieh > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Kai Kang >Assignee: L. C. Hsieh >Priority: Major > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-30617. - > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30617. --- Resolution: Duplicate > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-30617: --- > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weiwenda resolved SPARK-30617. -- Resolution: Pending Closed > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests
[ https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023354#comment-17023354 ] Dongjoon Hyun commented on SPARK-28900: --- Thank you for update. > Test Pyspark, SparkR on JDK 11 with run-tests > - > > Key: SPARK-28900 > URL: https://issues.apache.org/jira/browse/SPARK-28900 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Priority: Major > > Right now, we are testing JDK 11 with a Maven-based build, as in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/ > It looks like _all_ of the Maven-based jobs 'manually' build and invoke > tests, and only run tests via Maven -- that is, they do not run Pyspark or > SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} > script that is meant to be for this purpose. > In fact, there seem to be a couple flavors of copy-pasted build configs. SBT > builds look like: > {code} > #!/bin/bash > set -e > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention > export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER" > mkdir -p "$HOME" > export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2" > export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > git clean -fdx > ./dev/run-tests > {code} > Maven builds looks like: > {code} > #!/bin/bash > set -x > set -e > rm -rf ./work > git clean -fdx > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention: > export > SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2" > mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > MVN="build/mvn -DzincPort=$ZINC_PORT" > set +e > if [[ $HADOOP_PROFILE == hadoop-1 ]]; then > # Note that there is no -Pyarn flag here for Hadoop 1: > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > else > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > fi > if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then > if [[ $retcode1 -ne 0 ]]; then > echo "Packaging Spark with Maven failed" > fi > if [[ $retcode2 -ne 0 ]]; then > echo "Testing Spark with Maven failed" > fi > exit 1 > fi > {code} > The PR builder (one of them at least) looks like: > {code} > #!/bin/bash > set -e # fail on any non-zero exit code > set -x > export AMPLAB_JENKINS=1 > export PATH="$PATH:/home/anaconda/envs/py3k/bin" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > echo "fixing target dir permissions" > chmod -R +w target/* || true # stupid hack by sknapp to ensure that the > chmod always exits w/0 and doesn't bork the script > echo "running git clean -fdx" > git clean -fdx > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention > export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER" > mkdir -p "$HOME" > export SBT
[jira] [Commented] (SPARK-29189) Add an option to ignore block locations when listing file
[ https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023340#comment-17023340 ] Reynold Xin commented on SPARK-29189: - This is great, but how would users know when to set this? Shouldn't we do a slight incremental improvement to just automatically detect the common object stores and disable locality check? > Add an option to ignore block locations when listing file > - > > Key: SPARK-29189 > URL: https://issues.apache.org/jira/browse/SPARK-29189 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Assignee: Wang, Gang >Priority: Major > Fix For: 3.0.0 > > > In our PROD env, we have a pure Spark cluster, I think this is also pretty > common, where computation is separated from storage layer. In such deploy > mode, data locality is never reachable. > And there are some configurations in Spark scheduler to reduce waiting time > for data locality(e.g. "spark.locality.wait"). While, problem is that, in > listing file phase, the location informations of all the files, with all the > blocks inside each file, are all fetched from the distributed file system. > Actually, in a PROD environment, a table can be so huge that even fetching > all these location informations need take tens of seconds. > To improve such scenario, Spark need provide an option, where data locality > can be totally ignored, all we need in the listing file phase are the files > locations, without any block location informations. > > And we made a benchmark in our PROD env, after ignore the block locations, we > got a pretty huge improvement. > ||Table Size||Total File Number||Total Block Number||List File Duration(With > Block Location)||List File Duration(Without Block Location)|| > |22.6T|3|12|16.841s|1.730s| > |28.8 T|42001|148964|10.099s|2.858s| > |3.4 T|2| 2|5.833s|4.881s| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps
Bryan Cutler created SPARK-30640: Summary: Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps Key: SPARK-30640 URL: https://issues.apache.org/jira/browse/SPARK-30640 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.4.4 Reporter: Bryan Cutler During conversion of Arrow to Pandas, timestamp columns are modified to localize for the current timezone. If there are no timestamp columns, this can sometimes result in unnecessary copies of the data. See [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27117) current_date/current_timestamp should be reserved keywords in ansi parser mode
[ https://issues.apache.org/jira/browse/SPARK-27117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023317#comment-17023317 ] Reynold Xin commented on SPARK-27117: - I changed the title to make it more clear to end users what's happening. > current_date/current_timestamp should be reserved keywords in ansi parser mode > -- > > Key: SPARK-27117 > URL: https://issues.apache.org/jira/browse/SPARK-27117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27117) current_date/current_timestamp should be reserved keywords in ansi parser mode
[ https://issues.apache.org/jira/browse/SPARK-27117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-27117: Summary: current_date/current_timestamp should be reserved keywords in ansi parser mode (was: current_date/current_timestamp should not refer to columns with ansi parser mode) > current_date/current_timestamp should be reserved keywords in ansi parser mode > -- > > Key: SPARK-27117 > URL: https://issues.apache.org/jira/browse/SPARK-27117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0
[ https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25382: - Docs Text: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed. Use `spark.read.format("image").load(path)` instead. (was: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed. Use `spark.read.format(\"image\").load(path)` instead.) > Remove ImageSchema.readImages in 3.0 > > > Key: SPARK-25382 > URL: https://issues.apache.org/jira/browse/SPARK-25382 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > A follow-up task from SPARK-25345. We might need to support sampling > (SPARK-25383) in order to remove readImages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0
[ https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25382: - Docs Text: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed. Use `spark.read.format(\"image\").load(path)` instead. (was: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed.) > Remove ImageSchema.readImages in 3.0 > > > Key: SPARK-25382 > URL: https://issues.apache.org/jira/browse/SPARK-25382 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > A follow-up task from SPARK-25345. We might need to support sampling > (SPARK-25383) in order to remove readImages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30639) Upgrade Jersey to 2.30
[ https://issues.apache.org/jira/browse/SPARK-30639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30639: -- Parent: SPARK-29194 Issue Type: Sub-task (was: Improvement) > Upgrade Jersey to 2.30 > -- > > Key: SPARK-30639 > URL: https://issues.apache.org/jira/browse/SPARK-30639 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30639) Upgrade Jersey to 2.30
Dongjoon Hyun created SPARK-30639: - Summary: Upgrade Jersey to 2.30 Key: SPARK-30639 URL: https://issues.apache.org/jira/browse/SPARK-30639 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30638) add resources as parameter to the PluginContext
Thomas Graves created SPARK-30638: - Summary: add resources as parameter to the PluginContext Key: SPARK-30638 URL: https://issues.apache.org/jira/browse/SPARK-30638 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves Add the allocates resources and ResourceProfile to parameters to the PluginContext so that any plugins in driver or executor could use this information to initialize devices or use this information in a useful manner. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30632) to_timestamp() doesn't work with certain timezones
[ https://issues.apache.org/jira/browse/SPARK-30632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023263#comment-17023263 ] Maxim Gekk edited comment on SPARK-30632 at 1/24/20 9:12 PM: - Spark 2.4 and earlier versions use SimpleDateFormat to parse timestamp strings. Unfortunately, the class doesn't support time zones in the format like "America/Los_Angeles", see [https://stackoverflow.com/questions/23242211/java-simpledateformat-parse-timezone-like-america-los-angeles] . Spark 3.0 has migrated to DateTimeFormatter which doesn't have such issue. Port the changes back to Spark 2.4 is risky, and destabilizes it, IMHO. One of the reasons is this requires to change calendar system to Proleptic Gregorian calendar, see https://issues.apache.org/jira/browse/SPARK-26651 was (Author: maxgekk): Spark 2.4 and earlier versions use SimpleDateFormat to parse timestamp strings. Unfortunately, the class doesn't support time zones in the format like "America/Los_Angeles", see [https://stackoverflow.com/questions/23242211/java-simpledateformat-parse-timezone-like-america-los-angeles] . Spark 3.0 has migrated to DateTimeFormatter which doesn't have such issue. Port the changes back to Spark 2.4 is risky, and destabilizes it, IMHO. > to_timestamp() doesn't work with certain timezones > -- > > Key: SPARK-30632 > URL: https://issues.apache.org/jira/browse/SPARK-30632 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.4 >Reporter: Anton Daitche >Priority: Major > > It seams that to_timestamp() doesn't work with timezones of the type > /, e.g. America/Los_Angeles. > The code > {code:scala} > val df = Seq( > ("2019-01-24 11:30:00.123", "America/Los_Angeles"), > ("2020-01-01 01:30:00.123", "PST") > ).toDF("ts_str", "tz_name") > val ts_parsed = to_timestamp( > concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z" > ).as("timestamp") > df.select(ts_parsed).show(false) > {code} > prints > {code} > +---+ > |timestamp | > +---+ > |null | > |2020-01-01 10:30:00| > +---+ > {code} > So, the datetime string with timezone PST is properly parsed, whereas the one > with America/Los_Angeles is converted to null. According to > [this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] > response on GitHub, this code works when run on the recent master version. > See also the discussion in > [this|https://github.com/apache/spark/pull/24195#issue] issue for more > context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30632) to_timestamp() doesn't work with certain timezones
[ https://issues.apache.org/jira/browse/SPARK-30632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023263#comment-17023263 ] Maxim Gekk commented on SPARK-30632: Spark 2.4 and earlier versions use SimpleDateFormat to parse timestamp strings. Unfortunately, the class doesn't support time zones in the format like "America/Los_Angeles", see [https://stackoverflow.com/questions/23242211/java-simpledateformat-parse-timezone-like-america-los-angeles] . Spark 3.0 has migrated to DateTimeFormatter which doesn't have such issue. Port the changes back to Spark 2.4 is risky, and destabilizes it, IMHO. > to_timestamp() doesn't work with certain timezones > -- > > Key: SPARK-30632 > URL: https://issues.apache.org/jira/browse/SPARK-30632 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.4 >Reporter: Anton Daitche >Priority: Major > > It seams that to_timestamp() doesn't work with timezones of the type > /, e.g. America/Los_Angeles. > The code > {code:scala} > val df = Seq( > ("2019-01-24 11:30:00.123", "America/Los_Angeles"), > ("2020-01-01 01:30:00.123", "PST") > ).toDF("ts_str", "tz_name") > val ts_parsed = to_timestamp( > concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z" > ).as("timestamp") > df.select(ts_parsed).show(false) > {code} > prints > {code} > +---+ > |timestamp | > +---+ > |null | > |2020-01-01 10:30:00| > +---+ > {code} > So, the datetime string with timezone PST is properly parsed, whereas the one > with America/Los_Angeles is converted to null. According to > [this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] > response on GitHub, this code works when run on the recent master version. > See also the discussion in > [this|https://github.com/apache/spark/pull/24195#issue] issue for more > context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests
[ https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023250#comment-17023250 ] Shane Knapp commented on SPARK-28900: - FYI: i will be OOO next week (mon-thur) with VERY limited availability until friday (when i need to create the branch-3.0 jobs). and now to revisit this issue since it's been filed and address some points initially raised: {code:java} Narrowly, my suggestion is: * Make the master Maven-based builds use dev/run-tests too, so that Pyspark tests are run. It's meant to support this, if AMPLAB_JENKINS_BUILD_TOOL is set to "maven". I'm not sure if we've tested this, then, if it's not used. We may need new Jenkins jobs to make sure it works. * Leave the Spark 2.x builds as-is as 'legacy'. {code} re maven and dev/run-tests: this will be super easy and i can probably get that done really quickly. would dev/run-tests *replace* the mvn test block in the build script config? re 2.x builds: easy. {code:java} Why also test with SBT? Maven is the build of reference and presumably one test job is enough? if it was because the Maven configs weren't running all the tests, and we can fix that, then are the SBT builds superfluous? Maybe keep one to verify SBT builds still work {code} i still am unsure why we have both, but would be more than happy to delete the SBT builds (esp if we have the maven test run dev/run-tests {code:java} Shouldn't the PR builder look more like the other Jenkins builds? maybe it needs to be special, a bit. But should all of them be using run-tests-jenkins? {code} for the most part, dev/run-tests-jenkins exists for pull request builds and posting results to PRs. it also runs extra linting tests etc and acts mostly as a wrapper for dev/run-tests. i'm nearly certain we can leave this as-is. {code:java} Looks like some cruft in the configs that has built up over time. Can we review/delete some? things like Java 7 home, hard-coding a Maven path. Perhaps standardizing on the simpler run-tests invocation does this? {code} i've actually been doing a lot of cleanup in the build configs. i have a ways to go but things are MUCH cleaner. > Test Pyspark, SparkR on JDK 11 with run-tests > - > > Key: SPARK-28900 > URL: https://issues.apache.org/jira/browse/SPARK-28900 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Priority: Major > > Right now, we are testing JDK 11 with a Maven-based build, as in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/ > It looks like _all_ of the Maven-based jobs 'manually' build and invoke > tests, and only run tests via Maven -- that is, they do not run Pyspark or > SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} > script that is meant to be for this purpose. > In fact, there seem to be a couple flavors of copy-pasted build configs. SBT > builds look like: > {code} > #!/bin/bash > set -e > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention > export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER" > mkdir -p "$HOME" > export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2" > export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > git clean -fdx > ./dev/run-tests > {code} > Maven builds looks like: > {code} > #!/bin/bash > set -x > set -e > rm -rf ./work > git clean -fdx > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention: > export > SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2" > mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > MVN="build/mvn -DzincPort=$ZINC_PORT" > set +e > if [[ $HADOOP_PROFILE == hadoop-1 ]]; then > # Note that there is no -Pyarn flag here for Hadoop 1: > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmes
[jira] [Resolved] (SPARK-30630) Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-30630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30630. --- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 27330 [https://github.com/apache/spark/pull/27330] > Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0 > - > > Key: SPARK-30630 > URL: https://issues.apache.org/jira/browse/SPARK-30630 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.5, 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > Currently, GBT has > {code:java} > /** > * Number of trees in ensemble > */ > @Since("2.0.0") > val getNumTrees: Int = trees.length{code} > and > {code:java} > /** Number of trees in ensemble */ > val numTrees: Int = trees.length{code} > I will deprecate numTrees in 2.4.5 and remove it in 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30630) Deprecate numTrees in GBT
[ https://issues.apache.org/jira/browse/SPARK-30630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30630: - Assignee: Huaxin Gao > Deprecate numTrees in GBT > - > > Key: SPARK-30630 > URL: https://issues.apache.org/jira/browse/SPARK-30630 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.5, 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Currently, GBT has > {code:java} > /** > * Number of trees in ensemble > */ > @Since("2.0.0") > val getNumTrees: Int = trees.length{code} > and > {code:java} > /** Number of trees in ensemble */ > val numTrees: Int = trees.length{code} > I will deprecate numTrees in 2.4.5 and remove it in 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30630) Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-30630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30630: -- Summary: Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0 (was: Deprecate numTrees in GBT) > Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0 > - > > Key: SPARK-30630 > URL: https://issues.apache.org/jira/browse/SPARK-30630 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.5, 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Currently, GBT has > {code:java} > /** > * Number of trees in ensemble > */ > @Since("2.0.0") > val getNumTrees: Int = trees.length{code} > and > {code:java} > /** Number of trees in ensemble */ > val numTrees: Int = trees.length{code} > I will deprecate numTrees in 2.4.5 and remove it in 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30626) Add SPARK_APPLICATION_ID into driver pod env
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30626: -- Affects Version/s: (was: 2.4.4) > Add SPARK_APPLICATION_ID into driver pod env > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Jiaxin Shan >Assignee: Jiaxin Shan >Priority: Minor > Fix For: 3.0.0 > > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30626) Add SPARK_APPLICATION_ID into driver pod env
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30626: -- Summary: Add SPARK_APPLICATION_ID into driver pod env (was: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env) > Add SPARK_APPLICATION_ID into driver pod env > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jiaxin Shan >Assignee: Jiaxin Shan >Priority: Minor > Fix For: 3.0.0 > > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30626: - Assignee: Jiaxin Shan > [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jiaxin Shan >Assignee: Jiaxin Shan >Priority: Minor > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30626. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27347 [https://github.com/apache/spark/pull/27347] > [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jiaxin Shan >Assignee: Jiaxin Shan >Priority: Minor > Fix For: 3.0.0 > > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests
[ https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023224#comment-17023224 ] Dongjoon Hyun commented on SPARK-28900: --- Hi, All. Can we restart this before `branch-3.0` cut because we need to duplicate all `master` Jenkins jobs during cutting branch? > Test Pyspark, SparkR on JDK 11 with run-tests > - > > Key: SPARK-28900 > URL: https://issues.apache.org/jira/browse/SPARK-28900 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Priority: Major > > Right now, we are testing JDK 11 with a Maven-based build, as in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/ > It looks like _all_ of the Maven-based jobs 'manually' build and invoke > tests, and only run tests via Maven -- that is, they do not run Pyspark or > SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} > script that is meant to be for this purpose. > In fact, there seem to be a couple flavors of copy-pasted build configs. SBT > builds look like: > {code} > #!/bin/bash > set -e > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention > export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER" > mkdir -p "$HOME" > export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2" > export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > git clean -fdx > ./dev/run-tests > {code} > Maven builds looks like: > {code} > #!/bin/bash > set -x > set -e > rm -rf ./work > git clean -fdx > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention: > export > SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2" > mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > MVN="build/mvn -DzincPort=$ZINC_PORT" > set +e > if [[ $HADOOP_PROFILE == hadoop-1 ]]; then > # Note that there is no -Pyarn flag here for Hadoop 1: > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > else > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > fi > if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then > if [[ $retcode1 -ne 0 ]]; then > echo "Packaging Spark with Maven failed" > fi > if [[ $retcode2 -ne 0 ]]; then > echo "Testing Spark with Maven failed" > fi > exit 1 > fi > {code} > The PR builder (one of them at least) looks like: > {code} > #!/bin/bash > set -e # fail on any non-zero exit code > set -x > export AMPLAB_JENKINS=1 > export PATH="$PATH:/home/anaconda/envs/py3k/bin" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > echo "fixing target dir permissions" > chmod -R +w target/* || true # stupid hack by sknapp to ensure that the > chmod always exits w/0 and doesn't bork the script > echo "running git clean -fdx" > git clean -fdx > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock
[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28704: -- Target Version/s: 3.1.0 > Test backward compatibility on JDK9+ once we have a version supports JDK9+ > -- > > Key: SPARK-28704 > URL: https://issues.apache.org/jira/browse/SPARK-28704 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or > later because our previous version does not support JAVA_9 or later. We > should add it back once we have a version supports JAVA_9 or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28704: -- Parent: (was: SPARK-29194) Issue Type: Test (was: Sub-task) > Test backward compatibility on JDK9+ once we have a version supports JDK9+ > -- > > Key: SPARK-28704 > URL: https://issues.apache.org/jira/browse/SPARK-28704 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or > later because our previous version does not support JAVA_9 or later. We > should add it back once we have a version supports JAVA_9 or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28704: -- Labels: (was: 3.1.0) > Test backward compatibility on JDK9+ once we have a version supports JDK9+ > -- > > Key: SPARK-28704 > URL: https://issues.apache.org/jira/browse/SPARK-28704 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or > later because our previous version does not support JAVA_9 or later. We > should add it back once we have a version supports JAVA_9 or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28704: -- Labels: 3.1.0 (was: ) > Test backward compatibility on JDK9+ once we have a version supports JDK9+ > -- > > Key: SPARK-28704 > URL: https://issues.apache.org/jira/browse/SPARK-28704 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: 3.1.0 > > We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or > later because our previous version does not support JAVA_9 or later. We > should add it back once we have a version supports JAVA_9 or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29924) Document Arrow requirement in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29924: -- Description: At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is required for Arrow runtime on JDK9+ environment (was: At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is required for Arrow runtime on JDK9+ environment Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code incompatibility. We need to update R document like sparkr.md) > Document Arrow requirement in JDK9+ > --- > > Key: SPARK-29924 > URL: https://issues.apache.org/jira/browse/SPARK-29924 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is > required for Arrow runtime on JDK9+ environment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29924) Document Arrow requirement in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29924. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27356 [https://github.com/apache/spark/pull/27356] > Document Arrow requirement in JDK9+ > --- > > Key: SPARK-29924 > URL: https://issues.apache.org/jira/browse/SPARK-29924 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is > required for Arrow runtime on JDK9+ environment > Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code > incompatibility. We need to update R document like sparkr.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29924) Document Arrow requirement in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29924: - Assignee: Dongjoon Hyun > Document Arrow requirement in JDK9+ > --- > > Key: SPARK-29924 > URL: https://issues.apache.org/jira/browse/SPARK-29924 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is > required for Arrow runtime on JDK9+ environment > Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code > incompatibility. We need to update R document like sparkr.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023202#comment-17023202 ] Shane Knapp commented on SPARK-30637: - ok, i was able to easily uninstall 1.0.2 and reinstall 2.0.0 on my staging worker w/o issue. which, i have to admit, make me really nervous. :) {noformat} * installing *source* package ‘testthat’ ... ** package ‘testthat’ successfully unpacked and MD5 sums checked ** libs gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c init.c -o init.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c reassign.c -o reassign.o g++ -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c test-catch.cpp -o test-catch.o g++ -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c test-example.cpp -o test-example.o g++ -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c test-runner.cpp -o test-runner.o g++ -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o testthat.so init.o reassign.o test-catch.o test-example.o test-runner.o -L/usr/lib/R/lib -lR installing to /usr/local/lib/R/site-library/testthat/libs ** R ** inst ** tests ** preparing package for lazy loading ** help *** installing help indices *** copying figures ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (testthat){noformat} > upgrade testthat on jenkins workers to 2.0.0 > > > Key: SPARK-30637 > URL: https://issues.apache.org/jira/browse/SPARK-30637 > Project: Spark > Issue Type: Sub-task > Components: Build, jenkins, R >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > see: https://issues.apache.org/jira/browse/SPARK-23435 > i will investigate upgrading testthat on my staging worker, and if that goes > smoothly we can upgrade it on all jenkins workers. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23435) R tests should support latest testthat
[ https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023194#comment-17023194 ] Shane Knapp commented on SPARK-23435: - since we can't have different R environments for different spark branches, we should confirm that testthat 2.0.0 doesn't break the 2.4 branch before the jenkins workers are upgraded. > R tests should support latest testthat > -- > > Key: SPARK-23435 > URL: https://issues.apache.org/jira/browse/SPARK-23435 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0, 3.0.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was > released in Dec 2017, and its method has been changed. > In order for our tests to keep working, we need to detect that and call a > different method. > Jenkins is running 1.0.1 though, we need to check if it is going to work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-30637: Parent: SPARK-23435 Issue Type: Sub-task (was: Task) > upgrade testthat on jenkins workers to 2.0.0 > > > Key: SPARK-30637 > URL: https://issues.apache.org/jira/browse/SPARK-30637 > Project: Spark > Issue Type: Sub-task > Components: Build, jenkins, R >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > see: https://issues.apache.org/jira/browse/SPARK-23435 > i will investigate upgrading testthat on my staging worker, and if that goes > smoothly we can upgrade it on all jenkins workers. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0
Shane Knapp created SPARK-30637: --- Summary: upgrade testthat on jenkins workers to 2.0.0 Key: SPARK-30637 URL: https://issues.apache.org/jira/browse/SPARK-30637 Project: Spark Issue Type: Task Components: Build, jenkins, R Affects Versions: 3.0.0 Reporter: Shane Knapp Assignee: Shane Knapp see: https://issues.apache.org/jira/browse/SPARK-23435 i will investigate upgrading testthat on my staging worker, and if that goes smoothly we can upgrade it on all jenkins workers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30636) Unable to add packages on spark-packages.org
[ https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30636: Priority: Critical (was: Blocker) > Unable to add packages on spark-packages.org > > > Key: SPARK-30636 > URL: https://issues.apache.org/jira/browse/SPARK-30636 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.4 >Reporter: Xiao Li >Assignee: Burak Yavuz >Priority: Critical > > Unable to add new packages to spark-packages.org. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30636) Unable to add packages on spark-packages.org
[ https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-30636: --- Assignee: Burak Yavuz > Unable to add packages on spark-packages.org > > > Key: SPARK-30636 > URL: https://issues.apache.org/jira/browse/SPARK-30636 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.4 >Reporter: Xiao Li >Assignee: Burak Yavuz >Priority: Blocker > > Unable to add new packages to spark-packages.org. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30636) Unable to add packages on spark-packages.org
[ https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30636: Affects Version/s: (was: 3.0.0) 2.4.4 > Unable to add packages on spark-packages.org > > > Key: SPARK-30636 > URL: https://issues.apache.org/jira/browse/SPARK-30636 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.4 >Reporter: Xiao Li >Priority: Blocker > > Unable to add new packages to spark-packages.org. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30636) Unable to add packages on spark-packages.org
Xiao Li created SPARK-30636: --- Summary: Unable to add packages on spark-packages.org Key: SPARK-30636 URL: https://issues.apache.org/jira/browse/SPARK-30636 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.0.0 Reporter: Xiao Li Unable to add new packages to spark-packages.org. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023118#comment-17023118 ] Dongjoon Hyun edited comment on SPARK-30218 at 1/24/20 5:10 PM: No, this is fixed. The following is a similar case. User should do the disambiguation. {code} spark-sql> create table T (a int); Time taken: 0.348 seconds spark-sql> select a from T, T; Error in query: Reference 'a' is ambiguous, could be: default.t.a, default.t.a.; line 1 pos 7 {code} was (Author: dongjoon): No, this is fixed. The following is the same case. User should do the disambiguation. {code} spark-sql> create table T (a int); Time taken: 0.348 seconds spark-sql> select a from T, T; Error in query: Reference 'a' is ambiguous, could be: default.t.a, default.t.a.; line 1 pos 7 {code} > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023118#comment-17023118 ] Dongjoon Hyun edited comment on SPARK-30218 at 1/24/20 5:10 PM: No, this is fixed. The following is the same case. User should do the disambiguation. {code} spark-sql> create table T (a int); Time taken: 0.348 seconds spark-sql> select a from T, T; Error in query: Reference 'a' is ambiguous, could be: default.t.a, default.t.a.; line 1 pos 7 {code} was (Author: dongjoon): No, this is fixed. The following is the same case. User should do the disambiguation. {code} spark-sql> create table T (a int); Error in query: Table T already exists.; spark-sql> select a from T, T; Error in query: cannot resolve '`a`' given input columns: [default.t.id, default.t.id]; line 1 pos 7; {code} > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023118#comment-17023118 ] Dongjoon Hyun commented on SPARK-30218: --- No, this is fixed. The following is the same case. User should do the disambiguation. {code} spark-sql> create table T (a int); Error in query: Table T already exists.; spark-sql> select a from T, T; Error in query: cannot resolve '`a`' given input columns: [default.t.id, default.t.id]; line 1 pos 7; {code} > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables
[ https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023036#comment-17023036 ] Wenchen Fan commented on SPARK-30612: - I think the example from [~brkyvz] is right. The column name qualifier should only refer to what specified in the table name. > can't resolve qualified column name with v2 tables > -- > > Key: SPARK-30612 > URL: https://issues.apache.org/jira/browse/SPARK-30612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > When running queries with qualified columns like `SELECT t.a FROM t`, it > fails to resolve for v2 tables. > v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We > should do the same for v2 tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023019#comment-17023019 ] Jeff Evans commented on SPARK-19248: I'm not a Spark maintainer, so can't answer definitively. However, I would guess they won't change the default value. This was deliberately added in 2.0 with a default value of false, and usually breaking changes like this are introduced in new major versions (speaking in general terms). > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022969#comment-17022969 ] Rahul Kumar Challapalli commented on SPARK-30218: - [~dongjoon] I am not sure but I was pointing what the OP was asking. Since we don't disambiguate the columns in this case, should we keep this issue as open? > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-30635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022966#comment-17022966 ] jobit mathew commented on SPARK-30635: -- I will work on this > Document PARTITIONED BY Clause of CREATE statement in SQL Reference > > > Key: SPARK-30635 > URL: https://issues.apache.org/jira/browse/SPARK-30635 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference
jobit mathew created SPARK-30635: Summary: Document PARTITIONED BY Clause of CREATE statement in SQL Reference Key: SPARK-30635 URL: https://issues.apache.org/jira/browse/SPARK-30635 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 2.4.4 Reporter: jobit mathew -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)
[ https://issues.apache.org/jira/browse/SPARK-30634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yurii Oleynikov updated SPARK-30634: Description: Hi , I have an application that makes Arbitrary Stateful Processing in Structured Streaming and used delta.merge to update delta table and faced strange behaviour: 1. I've noticed that logs inside implementation of {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my application outputted twice. 2. While finding a root cause I've also found that number State rows reported by Spark is also doubles. I thought that may be there's a bug in my code, so I back to {{JavaStructuredSessionization}} from Apache Spark examples and changed it a bit. Still got same result. The problem happens only if I do not perform datch.DF.persist inside foreachBatch. {code:java} StreamingQuery query = sessionUpdates .writeStream() .outputMode("update") .foreachBatch((VoidFunction2, Long>) (batchDf, v2) -> { // following doubles number of spark state rows and causes MapGroupsWithStateFunction to log twice withport persisting deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), mergeExpr) .whenNotMatched().insertAll() .whenMatched() .updateAll() .execute(); }) .trigger(Trigger.ProcessingTime(1)) .queryName("ACME") .start(); {code} According to [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and [Apache spark docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch] there's seems to be no need to persist dataset/dataframe inside {{foreachBatch.}} Sample code from Apache Spark examples with delta: [JavaStructuredSessionization with Delta merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java] Appreciate your clarification. was: Hi , I've faced strange behaviour with Delta merge and Arbitrary Stateful Processing in Structured streaming. I have an application that makes Arbitrary Stateful Processing in Structured Streaming and used delta.merge to update delta table. I've noticed that longs inside implementation of {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my application outputted twice. While finding a root cause I've also found that number State rows reported by Spark is also doubles. I thought that may be there's a bug in my code, so I back to {{JavaStructuredSessionization}} Apache Spark examples and changed it a bit. Still got same result. The problem happens only if I do not perform datch.DF.persist inside foreachBatch. {code:java} StreamingQuery query = sessionUpdates .writeStream() .outputMode("update") .foreachBatch((VoidFunction2, Long>) (batchDf, v2) -> { // following doubles number of spark state rows and causes MapGroupsWithStateFunction to log twice withport persisting deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), mergeExpr) .whenNotMatched().insertAll() .whenMatched() .updateAll() .execute(); }) .trigger(Trigger.ProcessingTime(1)) .queryName("ACME") .start(); {code} According to [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and [Apache spark docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch] there's seems to be no need to persist dataset/dataframe inside {{foreachBatch.}} Sample code from Apache Spark examples with delta: [JavaStructuredSessionization with Delta merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java] Appreciate your clarification. > Delta Merge and Arbitrary Stateful Processing in Structured streaming > (foreachBatch) > - > > Key: SPARK-30634 > URL: https://issues.apache.org/jira/browse/SPARK-30634 > Project: Spark > Issue Type: Question > Components: Examples, Spark Core, Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 (scala 2.11.12) > Delta: 0.5.0 > Java(TM) SE Runtime Environment (build 1.8.0_91-b14) > OS: Ubuntu 18.04 LTS > >Reporter: Yurii Oleynikov >Priority: Trivial > Attachments: Capture1.PNG > > > Hi , > I have an application that makes Arbitrary Stateful Processing in Structured > Streaming and used delta.merge to update delta table and faced strange > behaviour: > 1. I've noticed that logs inside implementation
[jira] [Created] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)
Yurii Oleynikov created SPARK-30634: --- Summary: Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch) Key: SPARK-30634 URL: https://issues.apache.org/jira/browse/SPARK-30634 Project: Spark Issue Type: Question Components: Examples, Spark Core, Structured Streaming Affects Versions: 2.4.3 Environment: Spark 2.4.3 (scala 2.11.12) Delta: 0.5.0 Java(TM) SE Runtime Environment (build 1.8.0_91-b14) OS: Ubuntu 18.04 LTS Reporter: Yurii Oleynikov Attachments: Capture1.PNG Hi , I've faced strange behaviour with Delta merge and Arbitrary Stateful Processing in Structured streaming. I have an application that makes Arbitrary Stateful Processing in Structured Streaming and used delta.merge to update delta table. I've noticed that longs inside implementation of {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my application outputted twice. While finding a root cause I've also found that number State rows reported by Spark is also doubles. I thought that may be there's a bug in my code, so I back to {{JavaStructuredSessionization}} Apache Spark examples and changed it a bit. Still got same result. The problem happens only if I do not perform datch.DF.persist inside foreachBatch. {code:java} StreamingQuery query = sessionUpdates .writeStream() .outputMode("update") .foreachBatch((VoidFunction2, Long>) (batchDf, v2) -> { // following doubles number of spark state rows and causes MapGroupsWithStateFunction to log twice withport persisting deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), mergeExpr) .whenNotMatched().insertAll() .whenMatched() .updateAll() .execute(); }) .trigger(Trigger.ProcessingTime(1)) .queryName("ACME") .start(); {code} According to [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and [Apache spark docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch] there's seems to be no need to persist dataset/dataframe inside {{foreachBatch.}} Sample code from Apache Spark examples with delta: [JavaStructuredSessionization with Delta merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java] Appreciate your clarification. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)
[ https://issues.apache.org/jira/browse/SPARK-30634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yurii Oleynikov updated SPARK-30634: Attachment: Capture1.PNG > Delta Merge and Arbitrary Stateful Processing in Structured streaming > (foreachBatch) > - > > Key: SPARK-30634 > URL: https://issues.apache.org/jira/browse/SPARK-30634 > Project: Spark > Issue Type: Question > Components: Examples, Spark Core, Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 (scala 2.11.12) > Delta: 0.5.0 > Java(TM) SE Runtime Environment (build 1.8.0_91-b14) > OS: Ubuntu 18.04 LTS > >Reporter: Yurii Oleynikov >Priority: Trivial > Attachments: Capture1.PNG > > > Hi , I've faced strange behaviour with Delta merge and Arbitrary Stateful > Processing in Structured streaming. > I have an application that makes Arbitrary Stateful Processing in Structured > Streaming and used delta.merge to update delta table. > > I've noticed that longs inside implementation of > {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my > application outputted twice. > While finding a root cause I've also found that number State rows reported by > Spark is also doubles. > I thought that may be there's a bug in my code, so I back to > {{JavaStructuredSessionization}} Apache Spark examples and changed it a bit. > Still got same result. > The problem happens only if I do not perform datch.DF.persist inside > foreachBatch. > {code:java} > StreamingQuery query = sessionUpdates > .writeStream() > .outputMode("update") > .foreachBatch((VoidFunction2, Long>) (batchDf, > v2) -> { > // following doubles number of spark state rows and causes > MapGroupsWithStateFunction to log twice withport persisting > deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), > mergeExpr) > .whenNotMatched().insertAll() > .whenMatched() > .updateAll() > .execute(); > }) > .trigger(Trigger.ProcessingTime(1)) > .queryName("ACME") > .start(); > {code} > According to > [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and > [Apache spark > docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch] > there's seems to be no need to persist dataset/dataframe inside > {{foreachBatch.}} > Sample code from Apache Spark examples with delta: > [JavaStructuredSessionization with Delta > merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java] > > > Appreciate your clarification. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022900#comment-17022900 ] Tobias Hermann commented on SPARK-30421: [~dongjoon] I'm glad we are aligned now. :) For future reference: The original Pandas example {quote}df.drop(columns=["col1"]).loc[df["col1"] == 1] {quote} accesses the (unnamed) dataframe resulting from the drop call by row index (loc). This would even work (but not be very meaningful) by using a totally independent dataframe for this filtering. {quote}df_foo = pd.DataFrame(data=\{'foo': [0, 1]}) df_bar = pd.DataFrame(data=\{'bar': ["a", "b"]}) df_bar.loc[df_foo["foo"] == 1] {quote} > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022869#comment-17022869 ] weiwenda commented on SPARK-30617: -- [~dongjoon] Thanks for your advise. I will write Fix versions / Affected Version carefully next time. > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30633) Codegen fails when xxHash seed is not an integer
[ https://issues.apache.org/jira/browse/SPARK-30633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Cording updated SPARK-30633: Description: If the seed for xxHash is not an integer the generated code does not compile. Steps to reproduce: {code:java} import org.apache.spark.sql.catalyst.expressions.XxHash64 import org.apache.spark.sql.Column val file = "..." val column = col("...") val df = spark.read.csv(file) def xxHash(seed: Long, cols: Column*): Column = new Column( XxHash64(cols.map(_.expr), seed) ) val seed = (Math.pow(2, 32)+1).toLong df.select(xxHash(seed, column)).show() {code} Appending an L to the seed when the datatype is long fixes the issue. was: If the seed for xxHash is not an integer the generated code does not compile. Steps to reproduce: {code:java} import org.apache.spark.sql.catalyst.expressions.XxHash64 import org.apache.spark.sql.Column val file = "..." val column = col("...") val df = spark.read.csv(file) def xxHash(seed: Long, cols: Column*): Column = new Column( XxHash64(cols.map(_.expr), seed) ) val seed = (Math.pow(2, 32)+1).toLong df.select(xxHash(seed, column)).show() {code} Appending an L to the seed when the datatype is long fixes the issue. > Codegen fails when xxHash seed is not an integer > > > Key: SPARK-30633 > URL: https://issues.apache.org/jira/browse/SPARK-30633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Patrick Cording >Priority: Major > > If the seed for xxHash is not an integer the generated code does not compile. > Steps to reproduce: > {code:java} > import org.apache.spark.sql.catalyst.expressions.XxHash64 > import org.apache.spark.sql.Column > val file = "..." > val column = col("...") > val df = spark.read.csv(file) > def xxHash(seed: Long, cols: Column*): Column = new Column( >XxHash64(cols.map(_.expr), seed) > ) > val seed = (Math.pow(2, 32)+1).toLong > df.select(xxHash(seed, column)).show() > {code} > Appending an L to the seed when the datatype is long fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30633) Codegen fails when xxHash seed is not an integer
Patrick Cording created SPARK-30633: --- Summary: Codegen fails when xxHash seed is not an integer Key: SPARK-30633 URL: https://issues.apache.org/jira/browse/SPARK-30633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: Patrick Cording If the seed for xxHash is not an integer the generated code does not compile. Steps to reproduce: {code:java} import org.apache.spark.sql.catalyst.expressions.XxHash64 import org.apache.spark.sql.Column val file = "..." val column = col("...") val df = spark.read.csv(file) def xxHash(seed: Long, cols: Column*): Column = new Column( XxHash64(cols.map(_.expr), seed) ) val seed = (Math.pow(2, 32)+1).toLong df.select(xxHash(seed, column)).show() {code} Appending an L to the seed when the datatype is long fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022850#comment-17022850 ] Dongjoon Hyun commented on SPARK-30421: --- While rethinking about this, the original column's index might be different because it can be considered a value array without any meaning. Got it, [~tobias_hermann]. > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022850#comment-17022850 ] Dongjoon Hyun edited comment on SPARK-30421 at 1/24/20 10:44 AM: - While rethinking about this, the original column's index might be different because it can be considered as a value array without any meaning. Got it, [~tobias_hermann]. was (Author: dongjoon): While rethinking about this, the original column's index might be different because it can be considered a value array without any meaning. Got it, [~tobias_hermann]. > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022835#comment-17022835 ] Dongjoon Hyun commented on SPARK-30421: --- Nope. Your example is different. I illustrated what I wanted. "Pandas supports filtering with *the original column's index* on the dropped data frame." That's my point. I intentionally didn't declare `df2` or `df2["bar"]`. > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30632) to_timestamp() doesn't work with certain timezones
Anton Daitche created SPARK-30632: - Summary: to_timestamp() doesn't work with certain timezones Key: SPARK-30632 URL: https://issues.apache.org/jira/browse/SPARK-30632 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4, 2.3.0 Reporter: Anton Daitche It seams that to_timestamp() doesn't work with timezones of the type /, e.g. America/Los_Angeles. The code {code:scala} val df = Seq( ("2019-01-24 11:30:00.123", "America/Los_Angeles"), ("2020-01-01 01:30:00.123", "PST") ).toDF("ts_str", "tz_name") val ts_parsed = to_timestamp( concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z" ).as("timestamp") df.select(ts_parsed).show(false) {code} prints {code} +---+ |timestamp | +---+ |null | |2020-01-01 10:30:00| +---+ {code} So, the datetime string with timezone PST is properly parsed, whereas the one with America/Los_Angeles is converted to null. According to [this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] response on GitHub, this code works when run on the recent master version. See also the discussion in [this|https://github.com/apache/spark/pull/24195#issue] issue for more context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org