[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung @HyukjinKwon Per this [suggestion](https://github.com/apache/spark/pull/18003#discussion-diff-116853922L57), I'm creating more meaningful examples for the SQL functions. Since these functions can be grouped, we can create a single page doc for each group of the functions and construct concrete and useful examples for each group. The benefit is obvious: - Centralized documentation of related functions. This makes it easier for user to navigate. Right now there are TOO many items in the `see also` section. - Examples can share the same data. This avoids creating a data frame for each function if they are documented separately. - Cleaner structure and much fewer Rd files. Indeed, this is part of what was discussed in #17161. I have explored this for a few functions to illustrate the idea. Since this is a big effort, I would like to get folks' opinions before extending this to all functions. In this commit, I created docs for some sample functions in three groups: - 'column_datetime_functions' to document all datetime functions - 'column_aggregate_functions' to document all aggregate functions - 'column_math_functions' to document all math functions - ... Below is what 'column_datetime_functions.Rd' looks like: ![image](https://cloud.githubusercontent.com/assets/11082368/26189797/426029f0-3b5b-11e7-9175-c63b0e5c0014.png) ![image](https://cloud.githubusercontent.com/assets/11082368/26189810/56630954-3b5b-11e7-9d70-3e74b6d3b032.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17997: [SPARK-20763][SQL]The function of `month` and `day` retu...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/17997 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/18025 [WIP][SparkR] Update doc and examples for sql functions ## What changes were proposed in this pull request? Create better examples for sql functions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark sparkRDoc4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18025.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18025 commit 5c8cd1e5da896d78ea3cb4fcf5e046d22090dc2a Author: Wayne Zhang Date: 2017-05-18T06:32:42Z sql function examples prototype --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18020: [SPARK-20700][SQL] InferFiltersFromConstraints st...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18020 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18020: [SPARK-20700][SQL] InferFiltersFromConstraints stackover...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18020 Thanks! Merging to master/2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16989 **[Test build #77039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77039/testReport)** for PR 16989 at commit [`4ece142`](https://github.com/apache/spark/commit/4ece142d2a3c4b46a712539e3aa7f7ee0d4e6b5b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18011: [SPARK-19089][SQL] Add support for nested sequences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18011 **[Test build #77040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77040/testReport)** for PR 18011 at commit [`dd3bf01`](https://github.com/apache/spark/commit/dd3bf0113cbf66ebf784f68d7f602c39f4a46b8b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/16989 I think that the current use of `MemoryMode.OFF_HEAP` allocation will cause problems in out-of-the-box deployments using the default configurations. In Spark's current memory manager implementation the total amount of Spark-managed off-heap memory that we will use is controlled by `spark.memory.offHeap.size` and the default value is 0. In this PR, the comment on `spark.reducer.maxReqSizeShuffleToMem` says that it should be smaller than `spark.memory.offHeap.size` and yet the default is 200 megabytes so the default configuration is invalid. Because `preferDirectBufs()` is `true` by default it looks like the code here will always try to reserve memory using `MemoryMode.OFF_HEAP` and these reservations will always fail in the default configuration because the off-heap size will be zero, so I think the net effect of this patch will be to always spill to disk. One way to address this problem is to configure the default value of `spark.memory.offHeap.size` to match the JVM's internal limit on the amount of direct buffers that it can allocate minus some percentage or fixed overhead. Basically the problem is that Spark's off-heap memory manager was originally designed to only manage off-heap memory explicitly allocated by Spark itself when creating its own buffers / pages or caching blocks, not to account for off-heap memory used by lower-level code or third-party libraries. I'll see if I can think of a clean way to fix this, which I think will need to be done before the defaults used here can work as intended. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 ping @MLnick Do you have more comments on this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18000 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117168737 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -538,6 +538,21 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Disable Parquet predicate pushdown for fields having dots in the names") { --- End diff -- Looks much better now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117168546 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -47,39 +49,47 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2} *data type is nullable. */ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext { --- End diff -- Sure, I just revert it back and made a simple test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117168094 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStats(va
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117168074 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStats(va
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117167259 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStat
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117167238 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java --- @@ -730,4 +726,49 @@ public void testToLong() throws IOException { assertFalse(negativeInput, UTF8String.fromString(negativeInput).toLong(wrapper)); } } + @Test + public void trimsChar() { --- End diff -- sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117167072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStats(va
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117166546 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.ui import javax.servlet.http.HttpServletRequest import scala.collection.mutable -import scala.xml.Node +import scala.xml.{NodeSeq, Node} --- End diff -- please see ![scala](https://cloud.githubusercontent.com/assets/26266482/26188588/a9682798-3bd2-11e7-99b0-31587235f9a3.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117166463 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -510,6 +510,67 @@ public UTF8String trim() { } } + /** + * Removes all specified trim character string either from the beginning or the ending of a string + * @param trimString the trim character string + */ + public UTF8String trim(UTF8String trimString) { +// this method do the trimLeft first, then trimRight +int s = 0; // the searching byte position of the input string +int i = 0; // the first beginning byte position of a non-matching character +int e = 0; // the last byte position +int numChars = 0; // number of characters from the input string +int[] stringCharLen = new int[numBytes]; // array of character length for the input string +int[] stringCharPos = new int[numBytes]; // array of the first byte position for each character in the input string +int searchCharBytes; + +while (s < this.numBytes) { + UTF8String searchChar = copyUTF8String(s, s + numBytesForFirstByte(this.getByte(s)) - 1); + searchCharBytes = searchChar.numBytes; + // try to find the matching for the searchChar in the trimString set + if (trimString.find(searchChar, 0) >= 0) { --- End diff -- I described the behavior in the comments. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117166353 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -1069,6 +1069,8 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging override def visitFunctionCall(ctx: FunctionCallContext): Expression = withOrigin(ctx) { // Create the function call. val name = ctx.qualifiedName.getText +val trimFuncName = Option(ctx.trimOperator).map { + o => visitTrimFuncName(ctx, o)} --- End diff -- changed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117166374 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -510,6 +510,67 @@ public UTF8String trim() { } } + /** + * Removes all specified trim character string either from the beginning or the ending of a string --- End diff -- changed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117166341 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -461,68 +462,246 @@ case class FindInSet(left: Expression, right: Expression) extends BinaryExpressi } /** - * A function that trim the spaces from both ends for the specified string. + * A function that trims leading or trailing characters (or both) from the specified string. --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117166332 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -461,68 +462,246 @@ case class FindInSet(left: Expression, right: Expression) extends BinaryExpressi } /** - * A function that trim the spaces from both ends for the specified string. + * A function that trims leading or trailing characters (or both) from the specified string. */ @ExpressionDescription( - usage = "_FUNC_(str) - Removes the leading and trailing space characters from `str`.", + usage = """ +_FUNC_(str) - Removes the leading and trailing space characters from `str`. +_FUNC_(BOTH trimString FROM str) - Remove the leading and trailing trimString from `str` +_FUNC_(LEADING trimChar FROM str) - Remove the leading trimString from `str` +_FUNC_(TRAILING trimChar FROM str) - Remove the trailing trimString from `str` + """, extended = """ +Arguments: + str - a string expression + trimString - the trim string + BOTH, FROM - these are keyword to specify for trim string from both ends of the string + LEADING, FROM - these are keyword to specify for trim string from left end of the string + TRAILING, FROM - these are keyword to specify for trim string from right end of the string Examples: > SELECT _FUNC_('SparkSQL '); SparkSQL + > SELECT _FUNC_(BOTH 'SL' FROM 'SSparkSQLS'); + parkSQ + > SELECT _FUNC_(LEADING 'paS' FROM 'SSparkSQLS'); + rkSQLS + > SELECT _FUNC_(TRAILING 'SLQ' FROM 'SSparkSQLS'); + SSparkS """) -case class StringTrim(child: Expression) - extends UnaryExpression with String2StringExpression { +case class StringTrim(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes { + + require(children.size <= 2 && children.nonEmpty, +s"$prettyName requires at least one argument and no more than two.") + + override def dataType: DataType = StringType + override def inputTypes: Seq[AbstractDataType] = Seq.fill(children.size)(StringType) - def convert(v: UTF8String): UTF8String = v.trim() + override def nullable: Boolean = children.exists(_.nullable) + override def foldable: Boolean = children.forall(_.foldable) override def prettyName: String = "trim" + override def eval(input: InternalRow): Any = { +val inputs = children.map(_.eval(input).asInstanceOf[UTF8String]) +if (inputs(0) != null) { --- End diff -- sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17992: [SPARK-20759] SCALA_VERSION in _config.yml should be con...
Github user liu-zhaokun commented on the issue: https://github.com/apache/spark/pull/17992 @srowen The test doesn't finish,need I do anything? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117163859 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.ui import javax.servlet.http.HttpServletRequest import scala.collection.mutable -import scala.xml.Node +import scala.xml.{NodeSeq, Node} --- End diff -- I can't remember what flags/options run the style check with mvn, but you can always run it directly with `dev/scalastyle` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18024: [SPARK-20792][SS] Support same timeout operations...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/18024 [SPARK-20792][SS] Support same timeout operations in mapGroupsWithState function in batch queries as in streaming queries ## What changes were proposed in this pull request? Currently, in the batch queries, timeout is disabled (i.e. GroupStateTimeout.NoTimeout) which means any GroupState.setTimeout*** operation would throw UnsupportedOperationException. This makes it weird when converting a streaming query into a batch query by changing the input DF from streaming to a batch DF. If the timeout was enabled and used, then the batch query will start throwing UnsupportedOperationException. This creates the dummy state in batch queries with the provided timeoutConf so that it behaves in the same way. ## How was this patch tested? Additional tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-20792 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18024.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18024 commit eef789fe1fd04a98b4d82da6864ca4f4b23c2bfb Author: Tathagata Das Date: 2017-05-18T05:31:44Z Fixed bug --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117163563 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.ui import javax.servlet.http.HttpServletRequest import scala.collection.mutable -import scala.xml.Node +import scala.xml.{NodeSeq, Node} --- End diff -- How to run the style checker? But i build the code with maven success. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117163321 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.ui import javax.servlet.http.HttpServletRequest import scala.collection.mutable -import scala.xml.Node +import scala.xml.{NodeSeq, Node} --- End diff -- have you run the style checker? I think this may be in the wrong order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117162950 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -47,39 +49,47 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2} *data type is nullable. */ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext { --- End diff -- can we just have a simple end-to-end test? The fix is actually very simple and seems not worth such complex tests to verify it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18014 I thought that idea is for Apache Arrow. We could use binary type for `UnsafeArrayData`. However, it involves some complexity to use [`ColumnVector.Array`](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L1015-L1017). Is it better to use existing code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17995: [SPARK-20762][ML]Make String Params Case-Insensitive
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17995 ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17999: [SPARK-20751][SQL] Add built-in SQL Function - COT
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17999 **[Test build #77041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77041/testReport)** for PR 17999 at commit [`c80c184`](https://github.com/apache/spark/commit/c80c184d5a9f85e2bff740e8cf96bd9a97d0f8a7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117162403 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters { * Converts data sources filters to Parquet filter predicates. */ def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = { -val dataTypeOf = getFieldMap(schema) +val nameTypeMap = getFieldMap(schema) --- End diff -- nit: `nameToType` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18011: [SPARK-19089][SQL] Add support for nested sequences
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18011 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18011: [SPARK-19089][SQL] Add support for nested sequenc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18011#discussion_r117161759 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala --- @@ -258,6 +258,10 @@ class DatasetPrimitiveSuite extends QueryTest with SharedSQLContext { ListClass(List(1)) -> Queue("test" -> SeqClass(Seq(2 } + test("nested sequences") { +checkDataset(Seq(Seq(Seq(1))).toDS(), Seq(Seq(1))) --- End diff -- let's also add test for specific collection type, e.g. `List(Queue(1))` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18011: [SPARK-19089][SQL] Add support for nested sequences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18011 **[Test build #77040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77040/testReport)** for PR 18011 at commit [`dd3bf01`](https://github.com/apache/spark/commit/dd3bf0113cbf66ebf784f68d7f602c39f4a46b8b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18011: [SPARK-19089][SQL] Add support for nested sequences
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18011 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16986: [SPARK-18891][SQL] Support for Map collection typ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16986#discussion_r117160501 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -329,35 +329,19 @@ object ScalaReflection extends ScalaReflection { } UnresolvedMapObjects(mapFunction, getPath, Some(cls)) - case t if t <:< localTypeOf[Map[_, _]] => + case t if t <:< localTypeOf[Map[_, _]] || t <:< localTypeOf[java.util.Map[_, _]] => --- End diff -- we should handle java map in `JavaTypeInference`, but I think it's better to do it in another PR and focus on scala map in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18000 I would rather like to say it is a limitation in Parquet API. It looks there is no way to set column names having dots in Parquet filters properly. https://github.com/apache/spark/pull/17680 suggests a hacky workaround to set this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18000 a high-level question, is it a parquet bug or Spark doesn't use parquet reader correctly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18014 I may miss something, can we just treat array type as binary type and put it in `ColumnVector`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117158817 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -603,7 +603,13 @@ object DateTimeUtils { */ private[this] def getYearAndDayInYear(daysSince1970: SQLDate): (Int, Int) = { // add the difference (in days) between 1.1.1970 and the artificial year 0 (-17999) -val daysNormalized = daysSince1970 + toYearZero +var daysSince1970Tmp = daysSince1970 +// In history,the period(5.10.1582 ~ 14.10.1582) is not exist --- End diff -- OK, I will do ,thanks @kiszk @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14971#discussion_r117158766 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -175,7 +178,7 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto sql(s"INSERT INTO TABLE $textTable SELECT * FROM src") checkTableStats( textTable, -hasSizeInBytes = false, +hasSizeInBytes = true, --- End diff -- why the behavior is changed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14971#discussion_r117158738 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ShowCreateTableSuite.scala --- @@ -325,26 +325,24 @@ class ShowCreateTableSuite extends QueryTest with SQLTestUtils with TestHiveSing "last_modified_by", "last_modified_time", "Owner:", -"COLUMN_STATS_ACCURATE", // The following are hive specific schema parameters which we do not need to match exactly. -"numFiles", -"numRows", -"rawDataSize", -"totalSize", "totalNumberFiles", "maxFileSize", -"minFileSize", -// EXTERNAL is not non-deterministic, but it is filtered out for external tables. -"EXTERNAL" +"minFileSize" ) table.copy( createTime = 0L, lastAccessTime = 0L, -properties = table.properties.filterKeys(!nondeterministicProps.contains(_)) +properties = table.properties.filterKeys(!nondeterministicProps.contains(_)), +stats = None, +ignoredProperties = Map.empty ) } +val e = normalize(actual) +val m = normalize(expected) --- End diff -- remove this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14971#discussion_r117158531 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -414,6 +415,50 @@ private[hive] class HiveClientImpl( val properties = Option(h.getParameters).map(_.asScala.toMap).orNull + // Hive-generated Statistics are also recorded in ignoredProperties + val ignoredProperties = scala.collection.mutable.Map.empty[String, String] + for (key <- HiveStatisticsProperties; value <- properties.get(key)) { +ignoredProperties += key -> value + } + + val excludedTableProperties = HiveStatisticsProperties ++ Set( +// The property value of "comment" is moved to the dedicated field "comment" +"comment", +// For EXTERNAL_TABLE, the table properties has a particular field "EXTERNAL". This is added +// in the function toHiveTable. +"EXTERNAL" + ) + + val filteredProperties = properties.filterNot { +case (key, _) => excludedTableProperties.contains(key) + } + val comment = properties.get("comment") + + val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) + val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) + def rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match { +case Some(c) if c >= 0 => Some(c) +case _ => None + } + // TODO: check if this estimate is valid for tables after partition pruning. + // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be + // relatively cheap if parameters for the table are populated into the metastore. + // Currently, only totalSize, rawDataSize, and row_count are used to build the field `stats` + // TODO: stats should include all the other two fields (`numFiles` and `numPartitions`). + // (see StatsSetupConst in Hive) + val stats = + // When table is external, `totalSize` is always zero, which will influence join strategy + // so when `totalSize` is zero, use `rawDataSize` instead. When `rawDataSize` is also zero, + // return None. Later, we will use the other ways to estimate the statistics. + if (totalSize.isDefined && totalSize.get > 0L) { --- End diff -- the indention is wrong --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117158477 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala --- @@ -76,6 +76,9 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } checkEvaluation(DayOfYear(Literal.create(null, DateType)), null) + +checkEvaluation(DayOfYear(Literal(new Date(sdf.parse("1582-10-15 13:10:15").getTime))), 288) --- End diff -- OK, thanks @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117158402 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -490,6 +516,42 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("SPARK-20364 Do not push down filters when column names have dots") { +implicit class StringToAttribute(str: String) { + // Implicits for attr, $ and symbol do not handle backticks. + def attribute: Attribute = UnresolvedAttribute.quotedString(str) --- End diff -- Yea, actually my initial version in my local included the change for `symbol` and` $` to match them to `Column`. It also looks making sense per https://github.com/apache/spark/pull/7969. I believe this is an internal API - https://github.com/apache/spark/blob/e9c91badce64731ffd3e53cbcd9f044a7593e6b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L24 so I guess it would be fine even if it introduces a behaviour change. Nevertheless, I believe some guys don't like this change much and wanted to avoid such changes here for now (it is single place it needs anyway for now ... ). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117157965 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -490,6 +516,42 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("SPARK-20364 Do not push down filters when column names have dots") { +implicit class StringToAttribute(str: String) { + // Implicits for attr, $ and symbol do not handle backticks. + def attribute: Attribute = UnresolvedAttribute.quotedString(str) --- End diff -- Shall we make $ to use`UnresolvedAttribute.quotedString`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17995: [SPARK-20762][ML]Make String Params Case-Insensitive
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17995 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17995: [SPARK-20762][ML]Make String Params Case-Insensitive
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17995 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77038/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117157765 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala --- @@ -76,6 +76,9 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } checkEvaluation(DayOfYear(Literal.create(null, DateType)), null) + +checkEvaluation(DayOfYear(Literal(new Date(sdf.parse("1582-10-15 13:10:15").getTime))), 288) --- End diff -- let's follow mysql --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17995: [SPARK-20762][ML]Make String Params Case-Insensitive
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17995 **[Test build #77038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77038/testReport)** for PR 17995 at commit [`bed4c41`](https://github.com/apache/spark/commit/bed4c4183fa94b20d978ac9e61d225ea989c8a73). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17994: [SPARK-20505][ML] Add docs and examples for ml.st...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17994 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17994: [SPARK-20505][ML] Add docs and examples for ml.stat.Corr...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/17994 Merged into master and branch-2.2. Thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16989 **[Test build #77039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77039/testReport)** for PR 16989 at commit [`4ece142`](https://github.com/apache/spark/commit/4ece142d2a3c4b46a712539e3aa7f7ee0d4e6b5b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17996: [SPARK-20506][DOCS] 2.2 migration guide
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17996#discussion_r117155950 --- Diff: docs/ml-guide.md --- @@ -72,35 +72,26 @@ MLlib is under active development. The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. -## From 2.0 to 2.1 +## From 2.1 to 2.2 ### Breaking changes - -**Deprecated methods removed** -* `setLabelCol` in `feature.ChiSqSelectorModel` -* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`) -* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`) -* `model` in `regression.LinearRegressionSummary` -* `validateParams` in `PipelineStage` -* `validateParams` in `Evaluator` +There are no breaking changes. ### Deprecations and changes of behavior **Deprecations** -* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592): - Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel` +There are no deprecations. **Changes of behavior** --- End diff -- Should we include #17233 in this section? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117155497 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala --- @@ -76,6 +76,9 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } checkEvaluation(DayOfYear(Literal.create(null, DateType)), null) + +checkEvaluation(DayOfYear(Literal(new Date(sdf.parse("1582-10-15 13:10:15").getTime))), 288) --- End diff -- @cloud-fan Because in history,the period(5.10.1582 ~ 14.10.1582) is not exist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18017: [INFRA] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18017 (#16654 was took out as it was closed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117155315 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala --- @@ -76,6 +76,9 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } checkEvaluation(DayOfYear(Literal.create(null, DateType)), null) + +checkEvaluation(DayOfYear(Literal(new Date(sdf.parse("1582-10-15 13:10:15").getTime))), 288) --- End diff -- why `278` is better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16989 Checking the code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/ConfigProvider.scala#L59 `SparkConfigProvider` just check if the key is in JMap, if not return the default value. It doesn't check the alternatives. I think it seems this is the reason `org.apache.spark.memory.TaskMemoryManagerSuite.offHeapConfigurationBackwardsCompatibility ` fails. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17869: [SPARK-20609][CORE]Run the SortShuffleSuite unit tests h...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/17869 @srowen , I commit to modify the PR. Can you help me to run `test build` again. thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16989 that seems impossible, can you give an example? BTW if this blocks you, just revert the off-heap config changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18016: [SPARK-20786][SQL]Improve ceil and floor handle the valu...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/18016 @hvanhovell @srowen I have modify it again. and` floor` is same problem. review please. thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17995: [SPARK-20762][ML]Make String Params Case-Insensitive
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17995 **[Test build #77038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77038/testReport)** for PR 17995 at commit [`bed4c41`](https://github.com/apache/spark/commit/bed4c4183fa94b20d978ac9e61d225ea989c8a73). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17995: [SPARK-20762][ML]Make String Params Case-Insensitive
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17995 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117153595 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala --- @@ -76,6 +76,9 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } checkEvaluation(DayOfYear(Literal.create(null, DateType)), null) + +checkEvaluation(DayOfYear(Literal(new Date(sdf.parse("1582-10-15 13:10:15").getTime))), 288) --- End diff -- In mysql ,the rusult is : mysql> select dayofyear("1982-10-04"); +-+ | dayofyear("1982-10-04") | +-+ | 277 | +-+ 1 row in set (0.00 sec) mysql> select dayofyear("1982-10-015"); +--+ | dayofyear("1982-10-015") | +--+ | 288 | +--+ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117153570 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStat
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117153480 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStat
[GitHub] spark pull request #16654: [SPARK-19303][ML][WIP] Add evaluate method in clu...
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/16654 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18002#discussion_r117153431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala --- @@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends Serializable { /** * Gathers statistics information from `row(ordinal)`. */ - def gatherStats(row: InternalRow, ordinal: Int): Unit = { -if (row.isNullAt(ordinal)) { - nullCount += 1 - // 4 bytes for null position - sizeInBytes += 4 -} + def gatherStats(row: InternalRow, ordinal: Int): Unit + + /** + * Gathers statistics information on `null`. + */ + def gatherNullStats(): Unit = { +nullCount += 1 +// 4 bytes for null position +sizeInBytes += 4 count += 1 } /** - * Column statistics represented as a single row, currently including closed lower bound, closed + * Column statistics represented as an array, currently including closed lower bound, closed * upper bound and null count. */ - def collectedStatistics: GenericInternalRow + def collectedStatistics: Array[Any] } /** * A no-op ColumnStats only used for testing purposes. */ -private[columnar] class NoopColumnStats extends ColumnStats { - override def gatherStats(row: InternalRow, ordinal: Int): Unit = super.gatherStats(row, ordinal) +private[columnar] final class NoopColumnStats extends ColumnStats { + override def gatherStats(row: InternalRow, ordinal: Int): Unit = { +if (!row.isNullAt(ordinal)) { + count += 1 +} else { + gatherNullStats +} + } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L)) + override def collectedStatistics: Array[Any] = Array[Any](null, null, nullCount, count, 0L) } -private[columnar] class BooleanColumnStats extends ColumnStats { +private[columnar] final class BooleanColumnStats extends ColumnStats { protected var upper = false protected var lower = true override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getBoolean(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BOOLEAN.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Boolean): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BOOLEAN.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ByteColumnStats extends ColumnStats { +private[columnar] final class ByteColumnStats extends ColumnStats { protected var upper = Byte.MinValue protected var lower = Byte.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getByte(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += BYTE.defaultSize + gatherValueStats(value) +} else { + gatherNullStats } } - override def collectedStatistics: GenericInternalRow = -new GenericInternalRow(Array[Any](lower, upper, nullCount, count, sizeInBytes)) + def gatherValueStats(value: Byte): Unit = { +if (value > upper) upper = value +if (value < lower) lower = value +sizeInBytes += BYTE.defaultSize +count += 1 + } + + override def collectedStatistics: Array[Any] = +Array[Any](lower, upper, nullCount, count, sizeInBytes) } -private[columnar] class ShortColumnStats extends ColumnStats { +private[columnar] final class ShortColumnStats extends ColumnStats { protected var upper = Short.MinValue protected var lower = Short.MaxValue override def gatherStats(row: InternalRow, ordinal: Int): Unit = { -super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getShort(ordinal) - if (value > upper) upper = value - if (value < lower) lower = value - sizeInBytes += SHORT.defaultSize + gatherValueStat
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117153106 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala --- @@ -76,6 +76,9 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } checkEvaluation(DayOfYear(Literal.create(null, DateType)), null) + +checkEvaluation(DayOfYear(Literal(new Date(sdf.parse("1582-10-15 13:10:15").getTime))), 288) --- End diff -- can we check with other databases? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r117153080 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -603,7 +603,13 @@ object DateTimeUtils { */ private[this] def getYearAndDayInYear(daysSince1970: SQLDate): (Int, Int) = { // add the difference (in days) between 1.1.1970 and the artificial year 0 (-17999) -val daysNormalized = daysSince1970 + toYearZero +var daysSince1970Tmp = daysSince1970 +// In history,the period(5.10.1582 ~ 14.10.1582) is not exist --- End diff -- It's only about comment, and I think 1582-10-5 or Oct. 5, 1582 is more human readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16989 It seems like `SparkConfigProvider` is not checking alternatives in `SparkConf`. That's why spark.memory.offHeap.enabled is not set(still the default value), though we've already set `spark.unsafe.offHeap` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/16989#discussion_r117152091 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -278,4 +278,39 @@ package object config { "spark.io.compression.codec.") .booleanConf .createWithDefault(false) + + private[spark] val SHUFFLE_ACCURATE_BLOCK_THRESHOLD = +ConfigBuilder("spark.shuffle.accurateBlkThreshold") + .doc("When we compress the size of shuffle blocks in HighlyCompressedMapStatus, we will " + +"record the size accurately if it's above the threshold specified by this config. This " + +"helps to prevent OOM by avoiding underestimating shuffle block size when fetch shuffle " + +"blocks.") + .longConf + .createWithDefault(100 * 1024 * 1024) + + private[spark] val MEMORY_OFF_HEAP_ENABLED = +ConfigBuilder("spark.memory.offHeap.enabled") + .doc("If true, Spark will attempt to use off-heap memory for certain operations(e.g. sort, " + +"aggregate, etc. However, the buffer used for fetching shuffle blocks is always " + +"off-heap). If off-heap memory use is enabled, then spark.memory.offHeap.size must be " + +"positive.") + .booleanConf + .createWithDefault(false) + + private[spark] val MEMORY_OFF_HEAP_SIZE = +ConfigBuilder("spark.memory.offHeap.size") + .doc("The absolute amount of memory in bytes which can be used for off-heap allocation." + +" This setting has no impact on heap memory usage, so if your executors' total memory" + +" consumption must fit within some hard limit then be sure to shrink your JVM heap size" + +" accordingly. This must be set to a positive value when " + +"spark.memory.offHeap.enabled=true.") + .longConf --- End diff -- Yes, I should refine --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16989#discussion_r117151567 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -278,4 +278,39 @@ package object config { "spark.io.compression.codec.") .booleanConf .createWithDefault(false) + + private[spark] val SHUFFLE_ACCURATE_BLOCK_THRESHOLD = +ConfigBuilder("spark.shuffle.accurateBlkThreshold") + .doc("When we compress the size of shuffle blocks in HighlyCompressedMapStatus, we will " + +"record the size accurately if it's above the threshold specified by this config. This " + +"helps to prevent OOM by avoiding underestimating shuffle block size when fetch shuffle " + +"blocks.") + .longConf + .createWithDefault(100 * 1024 * 1024) + + private[spark] val MEMORY_OFF_HEAP_ENABLED = +ConfigBuilder("spark.memory.offHeap.enabled") + .doc("If true, Spark will attempt to use off-heap memory for certain operations(e.g. sort, " + +"aggregate, etc. However, the buffer used for fetching shuffle blocks is always " + +"off-heap). If off-heap memory use is enabled, then spark.memory.offHeap.size must be " + +"positive.") + .booleanConf + .createWithDefault(false) + + private[spark] val MEMORY_OFF_HEAP_SIZE = +ConfigBuilder("spark.memory.offHeap.size") + .doc("The absolute amount of memory in bytes which can be used for off-heap allocation." + +" This setting has no impact on heap memory usage, so if your executors' total memory" + +" consumption must fit within some hard limit then be sure to shrink your JVM heap size" + +" accordingly. This must be set to a positive value when " + +"spark.memory.offHeap.enabled=true.") + .longConf --- End diff -- we should use `.bytesConf(ByteUnit.BYTE)`, see `SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE` as an example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/18015 @ajbozarth Thank you very much for the suggestion that I have modified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14971 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14971 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77037/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14971 **[Test build #77037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77037/testReport)** for PR 14971 at commit [`cce31db`](https://github.com/apache/spark/commit/cce31db80cdc66516e3e537f33a3611b07186b6b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14971 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14971 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77036/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14971 **[Test build #77036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77036/testReport)** for PR 14971 at commit [`22a2c00`](https://github.com/apache/spark/commit/22a2c00333ffc39458f45d629c1b3199f73f1f3e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17435 I think we need a test and @holdenk's review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18017: [INFRA] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18017 (Actually, let me take out #17435. It looks recently updated and I believe it has a point there). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117148652 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -33,24 +33,24 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L override def render(request: HttpServletRequest): Seq[Node] = { val currentTime = System.currentTimeMillis() -val content = listener.synchronized { +var content : NodeSeq = listener.synchronized { --- End diff -- I'd rather not switch to a `var` (it's very un-scala), see below for alt suggestion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117148750 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,6 +61,36 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} +content = + + + { + if (listener.getRunningExecutions.nonEmpty) { + + Running Queries: + {listener.getRunningExecutions.size} + + } + } + { + if (listener.getCompletedExecutions.nonEmpty) { + + Completed Queries: + {listener.getCompletedExecutions.size} + + } + } + { + if (listener.getFailedExecutions.nonEmpty) { + + Failed Queries: + {listener.getFailedExecutions.size} + + } + } + + ++ content + UIUtils.headerSparkPage("SQL", content, parent, Some(5000)) --- End diff -- then you could replace `content` here with `summary ++ content` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117148693 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,6 +61,36 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} +content = --- End diff -- perhaps leave this as `summary`, but not `++ content` at the end --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18020: [SPARK-20700][SQL] InferFiltersFromConstraints stackover...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18020 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77035/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18020: [SPARK-20700][SQL] InferFiltersFromConstraints stackover...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18020 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18020: [SPARK-20700][SQL] InferFiltersFromConstraints stackover...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18020 **[Test build #77035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77035/testReport)** for PR 18020 at commit [`aa16ab3`](https://github.com/apache/spark/commit/aa16ab38fc0e0c80b179a5860f477c3650f64609). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117148664 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java --- @@ -730,4 +726,49 @@ public void testToLong() throws IOException { assertFalse(negativeInput, UTF8String.fromString(negativeInput).toLong(wrapper)); } } + @Test + public void trimsChar() { --- End diff -- Could you split this test case into three test cases for trim, trimLeft, trimRight? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/18015 @ajbozarth Rebuild, optimize the variable name. I add two screenshots.Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r117148012 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,7 +61,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} -UIUtils.headerSparkPage("SQL", content, parent, Some(5000)) + +val summary: NodeSeq = --- End diff -- Rebuild, optimize the variable name. I add two screenshots. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18000 Thank you @viirya. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18000#discussion_r117145159 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters { * Converts data sources filters to Parquet filter predicates. */ def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = { -val dataTypeOf = getFieldMap(schema) +val nameTypeMap = getFieldMap(schema) + +// Parquet does not allow dots in the column name because dots are used as a column path --- End diff -- Not just for speed. Also for the number of codes needed to change. But I think it is ok for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18000 Sounds ok for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77032/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #77032 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77032/testReport)** for PR 15821 at commit [`b4eebc2`](https://github.com/apache/spark/commit/b4eebc27e261eddb4d8b0b829245fa3c187dade1). * This patch **fails PySpark pip packaging tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18000 Just to make sure, I don't feel strongly for both comments @viirya. I am willing to fix if you feel strongly. Please let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org