[GitHub] spark pull request #16016: Branch 2.1
Github user horo90 closed the pull request at: https://github.com/apache/spark/pull/16016 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16016: Branch 2.1
GitHub user horo90 reopened a pull request: https://github.com/apache/spark/pull/16016 Branch 2.1 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16016.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16016 commit 39d2fdb51233ed9b1aaf3adaa3267853f5e58c0f Author: frreissDate: 2016-11-02T06:00:17Z [SPARK-17475][STREAMING] Delete CRC files if the filesystem doesn't use checksum files ## What changes were proposed in this pull request? When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files. ## How was this patch tested? Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran the entire regression suite. Author: frreiss Closes #15027 from frreiss/fred-17475. (cherry picked from commit 620da3b4828b3580c7ed7339b2a07938e6be1bb1) Signed-off-by: Reynold Xin commit e6509c2459e7ece3c3c6bcd143b8cc71f8f4d5c8 Author: Eric Liang Date: 2016-11-02T06:15:10Z [SPARK-18183][SPARK-18184] Fix INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables There are a couple issues with the current 2.1 behavior when inserting into Datasource tables with partitions managed by Hive. (1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table instead of just the specified partition. (2) INSERT|OVERWRITE does not work with partitions that have custom locations. This PR fixes both of these issues for Datasource tables managed by Hive. The behavior for legacy tables or when `manageFilesourcePartitions = false` is unchanged. There is one other issue in that INSERT OVERWRITE with dynamic partitions will overwrite the entire table instead of just the updated partitions, but this behavior is pretty complicated to implement for Datasource tables. We should address that in a future release. Unit tests. Author: Eric Liang Closes #15705 from ericl/sc-4942. (cherry picked from commit abefe2ec428dc24a4112c623fb6fbe4b2ca60a2b) Signed-off-by: Reynold Xin commit 85dd073743946383438aabb9f1281e6075f25cc5 Author: Reynold Xin Date: 2016-11-02T06:37:03Z [SPARK-18192] Support all file formats in structured streaming ## What changes were proposed in this pull request? This patch adds support for all file formats in structured streaming sinks. This is actually a very small change thanks to all the previous refactoring done using the new internal commit protocol API. ## How was this patch tested? Updated FileStreamSinkSuite to add test cases for json, text, and parquet. Author: Reynold Xin Closes #15711 from rxin/SPARK-18192. (cherry picked from commit a36653c5b7b2719f8bfddf4ddfc6e1b828ac9af1) Signed-off-by: Reynold Xin commit 4c4bf87acf2516a72b59f4e760413f80640dca1e Author: CodingCat Date: 2016-11-02T06:39:53Z [SPARK-18144][SQL] logging StreamingQueryListener$QueryStartedEvent ## What changes were proposed in this pull request? The PR fixes the bug that the QueryStartedEvent is not logged the postToAll() in the original code is actually calling StreamingQueryListenerBus.postToAll() which has no listener at allwe shall post by sparkListenerBus.postToAll(s) and this.postToAll() to trigger local listeners as well as the listeners registered in LiveListenerBus zsxwing ## How was this patch tested? The following snapshot shows that QueryStartedEvent has been logged correctly
[GitHub] spark pull request #16016: Branch 2.1
Github user horo90 closed the pull request at: https://github.com/apache/spark/pull/16016 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16003: [SPARK-18482][SQL] make sure Spark can access the table ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16003 How about Spark 2.1 altering the table metadata created by Spark 2.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15975 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69180/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15975 **[Test build #69180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69180/consoleFull)** for PR 15975 at commit [`1b0caea`](https://github.com/apache/spark/commit/1b0caea20bd233ffda5113c11234d8fd57f6faa3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16015 **[Test build #69181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69181/consoleFull)** for PR 16015 at commit [`9d965e7`](https://github.com/apache/spark/commit/9d965e74be85dcb1ae75ee102ee63a15c411a4d8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16015 Retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16015 The only one test failure is irrelevant to this PR. ``` [info] - set spark.sql.warehouse.dir *** FAILED *** (5 minutes, 0 seconds) [info] Timeout of './bin/spark-submit' '--class' 'org.apache.spark.sql.hive.SetWarehouseLocationTest' '--name' 'SetSparkWarehouseLocationTest' '--master' 'local-cluster[2,1,1024]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--driver-java-options' '-Dderby.system.durability=test' 'file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-27a1c717-99bc-44c6-8af7-710c8440c14d/testJar-1480135147576.jar' See the log4j logs for more detail. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/15975 @gatorsmile NP. Thank you for informing that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16016: Branch 2.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16016 Cloud you please close this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16016: Branch 2.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16016 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16016: Branch 2.1
GitHub user horo90 opened a pull request: https://github.com/apache/spark/pull/16016 Branch 2.1 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16016.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16016 commit 39d2fdb51233ed9b1aaf3adaa3267853f5e58c0f Author: frreissDate: 2016-11-02T06:00:17Z [SPARK-17475][STREAMING] Delete CRC files if the filesystem doesn't use checksum files ## What changes were proposed in this pull request? When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files. ## How was this patch tested? Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran the entire regression suite. Author: frreiss Closes #15027 from frreiss/fred-17475. (cherry picked from commit 620da3b4828b3580c7ed7339b2a07938e6be1bb1) Signed-off-by: Reynold Xin commit e6509c2459e7ece3c3c6bcd143b8cc71f8f4d5c8 Author: Eric Liang Date: 2016-11-02T06:15:10Z [SPARK-18183][SPARK-18184] Fix INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables There are a couple issues with the current 2.1 behavior when inserting into Datasource tables with partitions managed by Hive. (1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table instead of just the specified partition. (2) INSERT|OVERWRITE does not work with partitions that have custom locations. This PR fixes both of these issues for Datasource tables managed by Hive. The behavior for legacy tables or when `manageFilesourcePartitions = false` is unchanged. There is one other issue in that INSERT OVERWRITE with dynamic partitions will overwrite the entire table instead of just the updated partitions, but this behavior is pretty complicated to implement for Datasource tables. We should address that in a future release. Unit tests. Author: Eric Liang Closes #15705 from ericl/sc-4942. (cherry picked from commit abefe2ec428dc24a4112c623fb6fbe4b2ca60a2b) Signed-off-by: Reynold Xin commit 85dd073743946383438aabb9f1281e6075f25cc5 Author: Reynold Xin Date: 2016-11-02T06:37:03Z [SPARK-18192] Support all file formats in structured streaming ## What changes were proposed in this pull request? This patch adds support for all file formats in structured streaming sinks. This is actually a very small change thanks to all the previous refactoring done using the new internal commit protocol API. ## How was this patch tested? Updated FileStreamSinkSuite to add test cases for json, text, and parquet. Author: Reynold Xin Closes #15711 from rxin/SPARK-18192. (cherry picked from commit a36653c5b7b2719f8bfddf4ddfc6e1b828ac9af1) Signed-off-by: Reynold Xin commit 4c4bf87acf2516a72b59f4e760413f80640dca1e Author: CodingCat Date: 2016-11-02T06:39:53Z [SPARK-18144][SQL] logging StreamingQueryListener$QueryStartedEvent ## What changes were proposed in this pull request? The PR fixes the bug that the QueryStartedEvent is not logged the postToAll() in the original code is actually calling StreamingQueryListenerBus.postToAll() which has no listener at allwe shall post by sparkListenerBus.postToAll(s) and this.postToAll() to trigger local listeners as well as the listeners registered in LiveListenerBus zsxwing ## How was this patch tested? The following snapshot shows that QueryStartedEvent has been logged correctly ![image](https://cloud.githubusercontent.com/assets/678008/19821553/007a7d28-9d2d-11e6-9f13-49851559cdaa.png)
[GitHub] spark issue #15662: [SPARK-18141][SQL] Fix to quote column names in the pred...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15662 @sureshthalamati Could you resolve the conflict? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15662: [SPARK-18141][SQL] Fix to quote column names in t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15662#discussion_r89666271 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -172,7 +172,7 @@ class JDBCSuite extends SparkFunSuite """.stripMargin.replaceAll("\n", " ")) conn.prepareStatement( - "create table test.emp(name TEXT(32) NOT NULL," + + "create table test.emp(\"Name\" TEXT(32) NOT NULL," + --- End diff -- This is an unnecessary change, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for unidoc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16013 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69179/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for unidoc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16013 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for unidoc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16013 **[Test build #69179 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69179/consoleFull)** for PR 16013 at commit [`73fcd35`](https://github.com/apache/spark/commit/73fcd355a565c5ea433b1f8ca11e08ee6c3f2a9e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15975 **[Test build #69180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69180/consoleFull)** for PR 15975 at commit [`1b0caea`](https://github.com/apache/spark/commit/1b0caea20bd233ffda5113c11234d8fd57f6faa3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15975 @dongjoon-hyun Will not add test cases for the write path in this PR, because it requires code changes on the source codes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16015 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16015 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69178/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16015 **[Test build #69178 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69178/consoleFull)** for PR 16015 at commit [`9d965e7`](https://github.com/apache/spark/commit/9d965e74be85dcb1ae75ee102ee63a15c411a4d8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class OuterReference(e: NamedExpression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16013#discussion_r89665558 --- Diff: core/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala --- @@ -155,7 +155,7 @@ class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable { * to the right except for the last which is closed * e.g. for the array * [1, 10, 20, 50] the buckets are [1, 10) [10, 20) [20, 50] - * e.g 1<=x<10 , 10<=x<20, 20<=x<=50 + * e.g 1=x10 , 10=x20, 20=x=50 --- End diff -- This originally gives an error as below ``` [error] .../java/org/apache/spark/rdd/DoubleRDDFunctions.java:73: error: malformed HTML [error]* e.g 1<=x<10, 10<=x<20, 20<=x<=50 [error]^ [error] .../java/org/apache/spark/rdd/DoubleRDDFunctions.java:73: error: malformed HTML [error]* e.g 1<=x<10, 10<=x<20, 20<=x<=50 [error] ^ [error] .../java/org/apache/spark/rdd/DoubleRDDFunctions.java:73: error: malformed HTML [error]* e.g 1<=x<10, 10<=x<20, 20<=x<=50 [error] ^ ... ``` However, after fixing it as above, This is being printed as they are in javadoc (not in scaladoc) https://cloud.githubusercontent.com/assets/6477701/20638079/e17d0742-b3de-11e6-820d-d2ac85d09947.png;> It seems we should find another approach to deal with this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16007: [SPARK-18583][SQL] Fix nullability of InputFileNa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16007 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16007 Merging in master/branch-2.1. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16013#discussion_r89665095 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2063,6 +2063,7 @@ class SparkContext(config: SparkConf) extends Logging { * @param jobId the job ID to cancel * @throws InterruptedException if the cancel message cannot be sent --- End diff -- It seems fine - Scala https://cloud.githubusercontent.com/assets/6477701/20637897/1a78be2a-b3d9-11e6-939a-47c202a50037.png;> - Java https://cloud.githubusercontent.com/assets/6477701/20637898/1eded54e-b3d9-11e6-90a5-5b9c34ec0831.png;> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16013#discussion_r89664964 --- Diff: core/src/main/scala/org/apache/spark/Accumulator.scala --- @@ -26,7 +26,7 @@ package org.apache.spark * * An accumulator is created from an initial value `v` by calling * [[SparkContext#accumulator SparkContext.accumulator]]. - * Tasks running on the cluster can then add to it using the [[Accumulable#+= +=]] operator. + * Tasks running on the cluster can then add to it using the `+=` operator. --- End diff -- After this PR it still prints the same. - Scala https://cloud.githubusercontent.com/assets/6477701/20637848/2d670926-b3d7-11e6-8665-a9f3852545c2.png;> - Java https://cloud.githubusercontent.com/assets/6477701/20637849/322b675e-b3d7-11e6-925b-a9160f06bbc8.png;> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16013#discussion_r89664921 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -262,8 +262,9 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria /** * Get a time parameter as seconds; throws a NoSuchElementException if it's not set. If no * suffix is provided then seconds are assumed. - * @throws NoSuchElementException + * @throws java.util.NoSuchElementException --- End diff -- This is interesting. Using `@throws NoSuchElementException` complains as below: ``` [error] location: class VectorIndexerModel [error] .../java/org/apache/spark/SparkConf.java:226: error: reference not found [error]* @throws NoSuchElementException [error] ^ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16013#discussion_r89664911 --- Diff: core/src/main/scala/org/apache/spark/Accumulator.scala --- @@ -26,7 +26,7 @@ package org.apache.spark * * An accumulator is created from an initial value `v` by calling * [[SparkContext#accumulator SparkContext.accumulator]]. - * Tasks running on the cluster can then add to it using the [[Accumulable#+= +=]] operator. + * Tasks running on the cluster can then add to it using the `+=` operator. --- End diff -- I just decided to keep original format rather than trying to make this pretty. The original was as below: - Scala https://cloud.githubusercontent.com/assets/6477701/20637823/6f1c8914-b3d6-11e6-83f4-87355205d4c1.png;> - Java https://cloud.githubusercontent.com/assets/6477701/20637824/6f1cfce6-b3d6-11e6-93d7-2bae071f5753.png;> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16013: [WIP][SPARK-3359][DOCS] Make javadoc8 working for unidoc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16013 **[Test build #69179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69179/consoleFull)** for PR 16013 at commit [`73fcd35`](https://github.com/apache/spark/commit/73fcd355a565c5ea433b1f8ca11e08ee6c3f2a9e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15916: [SPARK-18487][SQL] Add completion listener to HashAggreg...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15916 Forget to say, of course, this example will thrown the exception only running in "test". Other developers would possibly encounter this when they write test codes in the future. If we could provide more info in this error message, we could save their time to investigate this. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16015: [SPARK-17251][SQL] Improve `OuterReference` to be `Named...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16015 **[Test build #69178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69178/consoleFull)** for PR 16015 at commit [`9d965e7`](https://github.com/apache/spark/commit/9d965e74be85dcb1ae75ee102ee63a15c411a4d8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16012: [SPARK-17251][SQL] Support `OuterReference` in projectio...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16012 Thank you for review, @hvanhovell and @nsyca . I agree with you. We need enough time for this. So, the option one for 2.1 is spun off into #16015. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16015: [SPARK-17251][SQL] Improve `OuterReference` to be...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/16015 [SPARK-17251][SQL] Improve `OuterReference` to be `NamedExpression` ## What changes were proposed in this pull request? Currently, `OuterReference` is not `NamedExpression`. So, it raises 'ClassCastException` when it used in projection lists of IN correlated subqueries. This PR aims to support that by making `OuterReference` as `NamedExpression` to show correct error messages. ```scala scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)") scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)") scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression ``` ## How was this patch tested? Pass the Jenkins test with new test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-17251-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16015.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16015 commit 9d965e74be85dcb1ae75ee102ee63a15c411a4d8 Author: Dongjoon HyunDate: 2016-11-26T03:24:29Z [SPARK-17251][SQL] Improve `OuterReference` to be `NamedExpression` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16007 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69177/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16007 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16007 **[Test build #69177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69177/consoleFull)** for PR 16007 at commit [`2657d95`](https://github.com/apache/spark/commit/2657d955741299431f708c99584514e999ef90c4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15975 Will update it tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15916: [SPARK-18487][SQL] Add completion listener to HashAggreg...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15916 The test case I added in this pr: val rng = new scala.util.Random(42) val data = sparkContext.parallelize(Seq.tabulate(100) { i => Row(Array.fill(10)(rng.nextInt(10))) }) val schema = StructType(Seq( StructField("arr", DataTypes.createArrayType(DataTypes.IntegerType)) )) val df = spark.createDataFrame(data, schema) val exploded = df.select(struct(col("*")).as("star"), explode(col("arr")).as("a")) val joined = exploded.join(exploded, "a").drop("a").distinct() joined.show() would thrown an exception like this: [info] - SPARK-18487: Consume all elements for show/take to avoid memory leak *** FAILED *** (1 second, 73 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 179.0 failed 1 times, most recent failure: Lost task 0.0 in stage 179.0 (TID 501, localhost, executor driver): org.apache.spark.SparkException: Managed memory leak detected; size = 33816576 bytes, TID = 501 [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:295) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) I submitted this pr because @sethah encountered this exception during his test. I think it might be other developers hit this in the future. If they don't know this part, from the error message they would think a memory leak happened. In order to avoid this and provide more useful info, I'd like to modify this error message too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89664100 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -989,7 +989,7 @@ class Analyzer( withPosition(u) { try { outer.resolve(nameParts, resolver) match { -case Some(outerAttr) => OuterReference(outerAttr) +case Some(outerAttr) => OuterReference(outerAttr)() --- End diff -- Another interesting case to consider: sql select ... from t1 where t1.c1 in (select sum(t1.c2) from t2) If we support correlated columns in SELECT clause, do we build the Aggregate on T2 or T1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15358: [SPARK-17783] [SQL] Hide Credentials in CREATE and DESC ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15358 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69176/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15358: [SPARK-17783] [SQL] Hide Credentials in CREATE and DESC ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15358 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15358: [SPARK-17783] [SQL] Hide Credentials in CREATE and DESC ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15358 **[Test build #69176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69176/consoleFull)** for PR 15358 at commit [`45e0ee3`](https://github.com/apache/spark/commit/45e0ee31347752b8a5f5bbf325a536b0aae1a3e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16014: [SPARK-18590][SPARKR] build R source package when making...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16014 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16014: [SPARK-18590][SPARKR] build R source package when making...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16014 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69175/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16014: [SPARK-18590][SPARKR] build R source package when making...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16014 **[Test build #69175 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69175/consoleFull)** for PR 16014 at commit [`7977139`](https://github.com/apache/spark/commit/79771392f7a8c7fe4ed90b20aec05e5e65304975). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16007 Thanks - LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16007 **[Test build #69177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69177/consoleFull)** for PR 16007 at commit [`2657d95`](https://github.com/apache/spark/commit/2657d955741299431f708c99584514e999ef90c4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16007 I see, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16007 Yes! That's what I meant -- change it false and add some documentation and one require to force that contract. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16014: [SPARK-18590][SPARKR] build R source package when making...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16014 @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15916: [SPARK-18487][SQL] Add completion listener to HashAggreg...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15916 Can you show an example of a leak that would happen in Executor but not in the callback? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15916: [SPARK-18487][SQL] Add completion listener to HashAggreg...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15916 @rxin BTW, I see you merged #15989 to downgrade error message level in TaskMemoryManager. I'd like to modify the error message in Executor too, because the current one is little confusing to developers if they don't know this part exactly and they would think there is memory leak happened. What do you think? If it is ok for you, I'd submit a pr for it. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16007 @rxin Sorry but finally we can change the nullable value to `false`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15358: [SPARK-17783] [SQL] Hide Credentials in CREATE and DESC ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15358 **[Test build #69176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69176/consoleFull)** for PR 15358 at commit [`45e0ee3`](https://github.com/apache/spark/commit/45e0ee31347752b8a5f5bbf325a536b0aae1a3e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15975 @gatorsmile did you update this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15736 I've conducted a lot of performance tests and gathered .hcd files so I can investigate this next week, but it looks like either the first commit is the best for performance or my current configuration with this benchmark results in us being unable to infer if our changes really make a difference. Sharing some raw data, the format is as follows. Benchmark name, date, time, data size in bytes (the same each run), the elapsed time and the throughput (bytes per second). **With the above suggestions for Partitioned*Buffer** ``` ScalaSparkPagerank 2016-11-25 18:49:23 25992811549.577 5242917 ScalaSparkPagerank 2016-11-25 18:56:55 25992811549.946 5204182 ScalaSparkPagerank 2016-11-25 19:00:04 25992811546.510 5588650 ScalaSparkPagerank 2016-11-25 19:02:23 25992811549.018 5302707 ScalaSparkPagerank 2016-11-25 19:05:25 25992811549.270 5275585 ``` **Vanilla, no changes at all** ``` ScalaSparkPagerank 2016-11-25 19:08:45 25992811548.068 5407508 ScalaSparkPagerank 2016-11-25 19:11:20 25992811547.712 5447856 ScalaSparkPagerank 2016-11-25 19:13:50 25992811544.517 5838850 ScalaSparkPagerank 2016-11-25 19:16:07 25992811549.942 5204599 ScalaSparkPagerank 2016-11-25 19:19:08 25992811548.521 5357023 ``` **Original commit** ``` ScalaSparkPagerank 2016-11-25 19:47:59 25992811545.486 5714464 ScalaSparkPagerank 2016-11-25 19:50:48 25992811548.507 5358569 ScalaSparkPagerank 2016-11-25 19:53:09 25992811547.063 5522982 ScalaSparkPagerank 2016-11-25 19:56:58 25992811546.154 5631757 ScalaSparkPagerank 2016-11-25 20:00:01 25992811548.935 5311701 ``` In Healthcenter I do see that these methods are still great candidates for optimisation as they are all very commonly used. Open to more suggestions, I have exclusive access to lots of hardware, can easily churn out more custom builds and have lots of profiling software we can use. I'll be committing code for the SizeEstimator soon as that's a good candidate for optimisation here as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15994#discussion_r89662083 --- Diff: project/MimaExcludes.scala --- @@ -529,6 +529,7 @@ object MimaExcludes { ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.mllib.evaluation.MulticlassMetrics.this"), ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.mllib.evaluation.RegressionMetrics.this"), ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.DataFrameNaFunctions.this"), + ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.DataFrameNaFunctions.fill"), --- End diff -- The thing is they are not backward compatible at bytecode level, so applications will break if they are not rebuilt. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16007 Alright I looked more into this -- I think your approach might be better actually. Can you add an require in `InputFileNameHolder.setInputFileName` to verify the input is not null, and then document in InputFileNameHolder to say the returned value should never be null, and empty string if it is unknown? Then we can change the nullable value to true for this expression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16007 @rxin And also we should modify the generated code to check the value is null or not, shouldn't we? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16008: [SPARK-18585][SQL] Use `ev.isNull = "false"` if p...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16008#discussion_r89661757 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -61,7 +61,6 @@ case class CreateArray(children: Seq[Expression]) extends Expression { ctx.addMutableState("Object[]", values, s"this.$values = null;") ev.copy(code = s""" - final boolean ${ev.isNull} = false; --- End diff -- can you explain how this change improves the code? I'd think it is no-op but maybe it's not the case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15916: [SPARK-18487][SQL] Add completion listener to Has...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/15916 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15916: [SPARK-18487][SQL] Add completion listener to HashAggreg...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15916 @rxin Thanks. Appreciate your feedback. I could close this now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16007 I see, I'll revert this and add the comment. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16009: [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scal...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16009#discussion_r89661577 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -49,15 +49,13 @@ private[feature] trait ChiSqSelectorParams extends Params * * @group param */ - @Since("1.6.0") --- End diff -- why are the`@since` removed, btw? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16007 I wouldn't change the default as it might break compatibility. That said, I don't think it is safe to just set this to non-nullable because it is a very implicit assumption, and setting it to be nullable is never "wrong". I'd add some comment explaining why it is nullable (e.g. "It depends on the semantics of the caller for InputFileNameHolder, and there is no guarantee that it won't be null") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16014: [SPARK-18590][SPARKR] build R source package when...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16014#discussion_r89661428 --- Diff: dev/create-release/release-build.sh --- @@ -189,6 +189,9 @@ if [[ "$1" == "package" ]]; then SHA512 $PYTHON_DIST_NAME > \ $PYTHON_DIST_NAME.sha +echo "Copying R source package" +cp spark-$SPARK_VERSION-bin-$NAME/R/SparkR_$SPARK_VERSION.tar.gz . --- End diff -- this is the source package we should release to CRAN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16014: [SPARK-18590][SPARKR] build R source package when...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16014#discussion_r89661414 --- Diff: R/pkg/NAMESPACE --- @@ -3,7 +3,7 @@ importFrom("methods", "setGeneric", "setMethod", "setOldClass") importFrom("methods", "is", "new", "signature", "show") importFrom("stats", "gaussian", "setNames") -importFrom("utils", "download.file", "object.size", "packageVersion", "untar") +importFrom("utils", "download.file", "object.size", "packageVersion", "tail", "untar") --- End diff -- This was regressed from a recent commit. check-cran.sh actually is flagging this in an existing NOTE but we only check for # of NOTE (which is still 1), and so this went in undetected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16014: [SPARK-18590][SPARKR] build R source package when...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16014#discussion_r89661364 --- Diff: R/pkg/DESCRIPTION --- @@ -1,28 +1,27 @@ Package: SparkR Type: Package -Title: R Frontend for Apache Spark Version: 2.1.0 -Date: 2016-11-06 --- End diff -- this is removed - I tried but haven't found a way to update this automatically, (I guess this could be in the [release-tag](https://github.com/apache/spark/blob/master/dev/create-release/release-tag.sh) script though) But more importantly, seems like many (most?) packages do not have this in their DESCRIPTION. In any case, release date are stamped when releasing to CRAN. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16014: [SPARK-18590][SPARKR] build R source package when making...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16014 **[Test build #69175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69175/consoleFull)** for PR 16014 at commit [`7977139`](https://github.com/apache/spark/commit/79771392f7a8c7fe4ed90b20aec05e5e65304975). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16014: [SPARK-18590][SPARKR] build R source package when...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/16014 [SPARK-18590][SPARKR] build R source package when making distribution ## What changes were proposed in this pull request? We should include in Spark distribution the built source package for SparkR. This will enable help and vignettes when the package is used. Also this source package is what we would release to CRAN. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step #1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rdist Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16014.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16014 commit 79771392f7a8c7fe4ed90b20aec05e5e65304975 Author: Felix CheungDate: 2016-11-25T23:00:25Z build source package in make-distribution, and take that as a part of the distribution --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15998 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15998 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69174/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15998 **[Test build #69174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69174/consoleFull)** for PR 15998 at commit [`4e03c3e`](https://github.com/apache/spark/commit/4e03c3e46d22e5fe1b1fbc01ea57ef15d2723b9b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16007: [SPARK-18583][SQL] Fix nullability of InputFileName.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16007 @rxin The default value is `""` (`UTF8String.fromString("")`) if input file name is not set for now. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala#L32 Should we change the default value to `null`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16012: [SPARK-17251][SQL] Support `OuterReference` in projectio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16012 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69173/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16012: [SPARK-17251][SQL] Support `OuterReference` in projectio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16012 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16012: [SPARK-17251][SQL] Support `OuterReference` in projectio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16012 **[Test build #69173 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69173/consoleFull)** for PR 16012 at commit [`3de9419`](https://github.com/apache/spark/commit/3de9419a30790020fb4d562625941dbc5e1772d2). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class OuterReference(e: NamedExpression)(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89657974 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -989,7 +989,7 @@ class Analyzer( withPosition(u) { try { outer.resolve(nameParts, resolver) match { -case Some(outerAttr) => OuterReference(outerAttr) +case Some(outerAttr) => OuterReference(outerAttr)() --- End diff -- I have not looked at the code changes closely but got a general idea of what the originally reported problem is. I second @hvanhovell to not support outer reference in a SELECT clause of a subquery in 2.1. Just fix the named expression first. IN subquery might be okay as it reflects the inner join semantics more or less. NOT IN subquery is converted to a special case of an anti-join with extra logic for the null value. sql select * from tbl_a where tbl_a.c1 not in (select tbl_a.c2 from tbl_b) Does the LeftAnti with effectively no join predicate, i.e., `(isnull(tbl_a.c1 = tbl_a.c2) || (tbl_a.c1 = tbl_a.c2))` work correctly today? And if it returns a correct result, is it by design, not by chance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15998#discussion_r89656487 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -482,6 +482,19 @@ class InMemoryCatalog( } } + override def listPartitionNames( + db: String, + table: String, + partialSpec: Option[TablePartitionSpec] = None): Seq[String] = synchronized { +val partitionColumnNames = getTable(db, table).partitionColumnNames + +listPartitions(db, table, partialSpec).map { partition => + partitionColumnNames.map { name => +name + "=" + partition.spec(name) --- End diff -- Does this need escaping, as provided by escapePathName? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15998#discussion_r89656749 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -922,6 +923,29 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat /** * Returns the partition names from hive metastore for a given table in a database. */ + override def listPartitionNames( + db: String, + table: String, + partialSpec: Option[TablePartitionSpec] = None): Seq[String] = withClient { +val actualPartColNames = getTable(db, table).partitionColumnNames +val clientPartitionNames = + client.getPartitionNames(db, table, partialSpec.map(lowerCasePartitionSpec)) + +if (actualPartColNames.exists(partColName => partColName != partColName.toLowerCase)) { + clientPartitionNames.map { partName => +val partSpec = PartitioningUtils.parsePathFragmentAsSeq(partName) --- End diff -- Is the (un)escaping here correct? It would be nice to have a unit test to verify these edge cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15998#discussion_r89656509 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala --- @@ -189,6 +189,21 @@ abstract class ExternalCatalog { spec: TablePartitionSpec): Option[CatalogTablePartition] /** + * List the names of all partitions that belong to the specified table, assuming it exists. + * + * A partial partition spec may optionally be provided to filter the partitions returned. + * For instance, if there exist partitions (a='1', b='2'), (a='1', b='3') and (a='2', b='4'), + * then a partial spec of (a='1') will return the first two only. --- End diff -- nit: newline here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15998#discussion_r89656787 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -922,6 +923,29 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat /** * Returns the partition names from hive metastore for a given table in a database. */ + override def listPartitionNames( + db: String, + table: String, + partialSpec: Option[TablePartitionSpec] = None): Seq[String] = withClient { +val actualPartColNames = getTable(db, table).partitionColumnNames +val clientPartitionNames = + client.getPartitionNames(db, table, partialSpec.map(lowerCasePartitionSpec)) + +if (actualPartColNames.exists(partColName => partColName != partColName.toLowerCase)) { + clientPartitionNames.map { partName => +val partSpec = PartitioningUtils.parsePathFragmentAsSeq(partName) +partSpec.map { case (partName, partValue) => + actualPartColNames.find(_.equalsIgnoreCase(partName)).get + "=" + partValue +}.mkString("/") + } +} else { + clientPartitionNames --- End diff -- Consider not having this optimization to avoid two different code paths here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89656824 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -989,7 +989,7 @@ class Analyzer( withPosition(u) { try { outer.resolve(nameParts, resolver) match { -case Some(outerAttr) => OuterReference(outerAttr) +case Some(outerAttr) => OuterReference(outerAttr)() --- End diff -- Hmm. Correct. I'll check that again. BTW, What about the predicates? It felt the predicates are handled the same way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89656679 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -356,10 +356,17 @@ case class PrettyAttribute( * A place holder used to hold a reference that has been resolved to a field outside of the current * plan. This is used for correlated subqueries. */ -case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable { +case class OuterReference(e: NamedExpression)( + val exprId: ExprId = NamedExpression.newExprId) --- End diff -- Is it okay? I thought it works like 'Alias'. Anyway, no problem. I'll update like that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15977: [SPARK-18436][SQL] isin causing SQL syntax error ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15977 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15977: [SPARK-18436][SQL] isin causing SQL syntax error with JD...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15977 LGTM. Merging to master/2.1. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89656204 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -989,7 +989,7 @@ class Analyzer( withPosition(u) { try { outer.resolve(nameParts, resolver) match { -case Some(outerAttr) => OuterReference(outerAttr) +case Some(outerAttr) => OuterReference(outerAttr)() --- End diff -- I am not sure the analyzer change has the desired effect. This just remove the outer reference from the tree, and this won't work if we use the attribute anywhere in the tree. For example: ```sql select * from tbl_a where id in (select x from (select tbl_b.id, tbl_a.id + 1 as x, tbl_a.id + tbl_b.id as y from tbl_b) where y > 0) ``` I think we need to break this down into two steps: 1. Do not support this for now and just fix the named expression. That would be my goal for 2.1. 2. Try to see if we can rewrite the tree in such a way that we can extract the value. That would be my goal for 2.2. I am not sure how well we can make this work. In the end I think we need a dedicated subquery operator. cc @nsyca what do you think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89655153 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -356,10 +356,17 @@ case class PrettyAttribute( * A place holder used to hold a reference that has been resolved to a field outside of the current * plan. This is used for correlated subqueries. */ -case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable { +case class OuterReference(e: NamedExpression)( + val exprId: ExprId = NamedExpression.newExprId) --- End diff -- Use the `exprId` of the `NamedExpression`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16012: [SPARK-17251][SQL] Support `OuterReference` in pr...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16012#discussion_r89655174 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -356,10 +356,17 @@ case class PrettyAttribute( * A place holder used to hold a reference that has been resolved to a field outside of the current * plan. This is used for correlated subqueries. */ -case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable { +case class OuterReference(e: NamedExpression)( + val exprId: ExprId = NamedExpression.newExprId) + extends LeafExpression with NamedExpression with Unevaluable { override def dataType: DataType = e.dataType override def nullable: Boolean = e.nullable override def prettyName: String = "outer" + + override def name: String = e.name + override def qualifier: Option[String] = e.qualifier + override def toAttribute: Attribute = e.toAttribute + override def newInstance(): NamedExpression = OuterReference(e)() --- End diff -- `OuterReference(e.newInstance())()`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15998 **[Test build #69174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69174/consoleFull)** for PR 15998 at commit [`4e03c3e`](https://github.com/apache/spark/commit/4e03c3e46d22e5fe1b1fbc01ea57ef15d2723b9b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15998 CC @ericl @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15736 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69172/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15736 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15736 **[Test build #69172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69172/consoleFull)** for PR 15736 at commit [`53ed170`](https://github.com/apache/spark/commit/53ed1708112fbf66b04fe89502e534ca3270d15c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14136: [SPARK-16282][SQL] Implement percentile SQL funct...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14136#discussion_r89652990 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -0,0 +1,292 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.Countings +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET +import org.apache.spark.util.collection.OpenHashMap + + +/** + * The Percentile aggregate function returns the exact percentile(s) of numeric column `expr` at + * the given percentage(s) with value range in [0.0, 1.0]. + * + * The operator is bound to the slower sort based aggregation path because the number of elements + * and their partial order cannot be determined in advance. Therefore we have to store all the + * elements in memory, and that too many elements can cause GC paused and eventually OutOfMemory + * Errors. + * + * @param child child expression that produce numeric column value with `child.eval(inputRow)` + * @param percentageExpression Expression that represents a single percentage value or an array of + * percentage values. Each percentage value must be in the range + * [0.0, 1.0]. + */ +@ExpressionDescription( + usage = +""" + _FUNC_(col, percentage) - Returns the exact percentile value of numeric column `col` at the + given percentage. The value of percentage must be between 0.0 and 1.0. + + _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array + of numeric column `col` at the given percentage(s). Each value of the percentage array must + be between 0.0 and 1.0. +""") +case class Percentile( + child: Expression, + percentageExpression: Expression, + mutableAggBufferOffset: Int = 0, + inputAggBufferOffset: Int = 0) extends TypedImperativeAggregate[Countings] { + + def this(child: Expression, percentageExpression: Expression) = { +this(child, percentageExpression, 0, 0) + } + + override def prettyName: String = "percentile" + + override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): Percentile = +copy(mutableAggBufferOffset = newMutableAggBufferOffset) + + override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): Percentile = +copy(inputAggBufferOffset = newInputAggBufferOffset) + + // Mark as lazy so that percentageExpression is not evaluated during tree transformation. + private lazy val (returnPercentileArray: Boolean, percentages: Seq[Number]) = +evalPercentages(percentageExpression) + + override def children: Seq[Expression] = child :: percentageExpression :: Nil + + // Returns null for empty inputs + override def nullable: Boolean = true + + override def dataType: DataType = +if (returnPercentileArray) ArrayType(DoubleType) else DoubleType + + override def inputTypes: Seq[AbstractDataType] = +Seq(NumericType, TypeCollection(NumericType, ArrayType)) + + override def checkInputDataTypes(): TypeCheckResult = +TypeUtils.checkForNumericExpr(child.dataType, "function percentile") + + override def createAggregationBuffer(): Countings = { +// Initialize new Countings instance here. +Countings() + } + + private def evalPercentages(expr: Expression): (Boolean, Seq[Number]) = { +val (isArrayType, values) =
[GitHub] spark pull request #14136: [SPARK-16282][SQL] Implement percentile SQL funct...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14136#discussion_r89647985 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -0,0 +1,292 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.Countings +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET +import org.apache.spark.util.collection.OpenHashMap + + +/** + * The Percentile aggregate function returns the exact percentile(s) of numeric column `expr` at + * the given percentage(s) with value range in [0.0, 1.0]. + * + * The operator is bound to the slower sort based aggregation path because the number of elements + * and their partial order cannot be determined in advance. Therefore we have to store all the + * elements in memory, and that too many elements can cause GC paused and eventually OutOfMemory + * Errors. + * + * @param child child expression that produce numeric column value with `child.eval(inputRow)` + * @param percentageExpression Expression that represents a single percentage value or an array of + * percentage values. Each percentage value must be in the range + * [0.0, 1.0]. + */ +@ExpressionDescription( + usage = +""" + _FUNC_(col, percentage) - Returns the exact percentile value of numeric column `col` at the + given percentage. The value of percentage must be between 0.0 and 1.0. + + _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array + of numeric column `col` at the given percentage(s). Each value of the percentage array must + be between 0.0 and 1.0. +""") +case class Percentile( + child: Expression, + percentageExpression: Expression, + mutableAggBufferOffset: Int = 0, + inputAggBufferOffset: Int = 0) extends TypedImperativeAggregate[Countings] { + + def this(child: Expression, percentageExpression: Expression) = { +this(child, percentageExpression, 0, 0) + } + + override def prettyName: String = "percentile" + + override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): Percentile = +copy(mutableAggBufferOffset = newMutableAggBufferOffset) + + override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): Percentile = +copy(inputAggBufferOffset = newInputAggBufferOffset) + + // Mark as lazy so that percentageExpression is not evaluated during tree transformation. + private lazy val (returnPercentileArray: Boolean, percentages: Seq[Number]) = +evalPercentages(percentageExpression) + + override def children: Seq[Expression] = child :: percentageExpression :: Nil + + // Returns null for empty inputs + override def nullable: Boolean = true + + override def dataType: DataType = +if (returnPercentileArray) ArrayType(DoubleType) else DoubleType + + override def inputTypes: Seq[AbstractDataType] = +Seq(NumericType, TypeCollection(NumericType, ArrayType)) + + override def checkInputDataTypes(): TypeCheckResult = +TypeUtils.checkForNumericExpr(child.dataType, "function percentile") + + override def createAggregationBuffer(): Countings = { +// Initialize new Countings instance here. +Countings() + } + + private def evalPercentages(expr: Expression): (Boolean, Seq[Number]) = { +val (isArrayType, values) =
[GitHub] spark pull request #14136: [SPARK-16282][SQL] Implement percentile SQL funct...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14136#discussion_r89646058 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -0,0 +1,292 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.Countings +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET +import org.apache.spark.util.collection.OpenHashMap + + +/** + * The Percentile aggregate function returns the exact percentile(s) of numeric column `expr` at + * the given percentage(s) with value range in [0.0, 1.0]. + * + * The operator is bound to the slower sort based aggregation path because the number of elements + * and their partial order cannot be determined in advance. Therefore we have to store all the + * elements in memory, and that too many elements can cause GC paused and eventually OutOfMemory + * Errors. + * + * @param child child expression that produce numeric column value with `child.eval(inputRow)` + * @param percentageExpression Expression that represents a single percentage value or an array of + * percentage values. Each percentage value must be in the range + * [0.0, 1.0]. + */ +@ExpressionDescription( + usage = +""" + _FUNC_(col, percentage) - Returns the exact percentile value of numeric column `col` at the + given percentage. The value of percentage must be between 0.0 and 1.0. + + _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array + of numeric column `col` at the given percentage(s). Each value of the percentage array must + be between 0.0 and 1.0. +""") +case class Percentile( + child: Expression, + percentageExpression: Expression, + mutableAggBufferOffset: Int = 0, + inputAggBufferOffset: Int = 0) extends TypedImperativeAggregate[Countings] { + + def this(child: Expression, percentageExpression: Expression) = { +this(child, percentageExpression, 0, 0) + } + + override def prettyName: String = "percentile" + + override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): Percentile = +copy(mutableAggBufferOffset = newMutableAggBufferOffset) + + override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): Percentile = +copy(inputAggBufferOffset = newInputAggBufferOffset) + + // Mark as lazy so that percentageExpression is not evaluated during tree transformation. + private lazy val (returnPercentileArray: Boolean, percentages: Seq[Number]) = +evalPercentages(percentageExpression) + + override def children: Seq[Expression] = child :: percentageExpression :: Nil + + // Returns null for empty inputs + override def nullable: Boolean = true + + override def dataType: DataType = +if (returnPercentileArray) ArrayType(DoubleType) else DoubleType + + override def inputTypes: Seq[AbstractDataType] = +Seq(NumericType, TypeCollection(NumericType, ArrayType)) + + override def checkInputDataTypes(): TypeCheckResult = +TypeUtils.checkForNumericExpr(child.dataType, "function percentile") + + override def createAggregationBuffer(): Countings = { +// Initialize new Countings instance here. +Countings() + } + + private def evalPercentages(expr: Expression): (Boolean, Seq[Number]) = { +val (isArrayType, values) =