Re: Support SqlStreaming in spark
Hi all, I have rewritten the design doc based on previous discussing. https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0 Would be interested to hear what others think. Regards, Genmao Yu -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Support SqlStreaming in spark
Hi all, I have rewritten the design doc based on previous discussing. https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0 Would be interested to hear what others think. Regards, Genmao Yu -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[GitHub] spark pull request #22575: [SPARK-24630][SS] Support SQLStreaming in Spark
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/22575#discussion_r237372804 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StreamTableDDLCommandSuite.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.hive + +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class StreamTableDDLCommandSuite extends SQLTestUtils with TestHiveSingleton { + private val catalog = spark.sessionState.catalog + + test("CTAS: create data source stream table") { +withTempPath { dir => + withTable("t") { +sql( + s"""CREATE TABLE t USING PARQUET + |OPTIONS ( + |PATH = '${dir.toURI}', + |location = '${dir.toURI}', + |isStreaming = 'true') + |AS SELECT 1 AS a, 2 AS b, 3 AS c + """.stripMargin) --- End diff -- At https://github.com/apache/spark/pull/22575/files#diff-fa4547f0c6dd7810576cd4262a2dfb46R78 the `child` logicalPlan is not streaming logicalPlan? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18099: [SPARK-18406][CORE][Backport-2.1] Race between end-of-ta...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/18099 same issue in spark 2.2.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] kafka pull request #3897: KAFKA-5929: Save pre-assignment to file to avoid t...
GitHub user uncleGen reopened a pull request: https://github.com/apache/kafka/pull/3897 KAFKA-5929: Save pre-assignment to file to avoid too long text to display when do topic partition reassign When do partition reassign - before pr Pre-assignment will be printed directly. It is not friendly when the text is too long. - after pr Pre-assignment will still be printed directly, but will be save to a file at the same time, naming with suffix ".rollback" of "reassignment-json-file". For example: ``` ./kafka-reassign-partitions.sh --reassignment-json-file test.json ... ``` then we may get a file **test.json.rollback** You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/kafka KAFKA-5929 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3897.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3897 commit b5806c9f09459e473d6d1eb66c46a546c526d9c7 Author: uncleGen Date: 2017-09-19T07:40:09Z Save pre-assignment to file to avoid too long text to display when do topic partition reassign commit 19d3f84a237c3a685d805f0974282734fcc8e655 Author: uncleGen Date: 2017-09-19T08:23:50Z findbugs fix ---
[GitHub] kafka pull request #3894: KAFKA-5928: Avoid redundant requests to zookeeper ...
GitHub user uncleGen reopened a pull request: https://github.com/apache/kafka/pull/3894 KAFKA-5928: Avoid redundant requests to zookeeper when reassign topic partition We mistakenly request topic level information according to partitions config in the assignment json file. For example https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/admin/ReassignPartitionsCommand.scala#L550: ``` val validPartitions = proposedPartitionAssignment.filter { case (p, _) => validatePartition(zkUtils, p.topic, p.partition) } ``` If reassign 1000 partitions (in 10 topics), we need to request zookeeper 1000 times here. But actually we only need to request just 10 (topics) times. We test a large-scale assignment, about 10K partitions. It takes tens of minutes. After optimization, it will reduce to less than 1minute. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/kafka KAFKA-5928 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3894.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3894 commit b8ebcbe00b4b10cdda023efceed2789e39f75781 Author: uncleGen Date: 2017-09-19T03:01:20Z Avoid redundant requests to zookeeper when reassign topic partition ---
[GitHub] kafka pull request #3894: KAFKA-5928: Avoid redundant requests to zookeeper ...
Github user uncleGen closed the pull request at: https://github.com/apache/kafka/pull/3894 ---
[GitHub] kafka pull request #3897: KAFKA-5929: Save pre-assignment to file to avoid t...
Github user uncleGen closed the pull request at: https://github.com/apache/kafka/pull/3897 ---
[GitHub] kafka pull request #3897: KAFKA-5929: Save pre-assignment to file to avoid t...
GitHub user uncleGen opened a pull request: https://github.com/apache/kafka/pull/3897 KAFKA-5929: Save pre-assignment to file to avoid too long text to display when do topic partition reassign You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/kafka KAFKA-5929 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3897.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3897 commit 9a78be366c3ad99f4418c6dab1c9701739b26c07 Author: æ¨è® Date: 2017-09-19T07:40:09Z Save pre-assignment to file to avoid too long text to display when do topic partition reassign ---
[GitHub] spark pull request #16656: [SPARK-18116][DStream] Report stream input inform...
Github user uncleGen closed the pull request at: https://github.com/apache/spark/pull/16656 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] kafka pull request #3894: KAFKA-5928: Avoid redundant requests to zookeeper ...
GitHub user uncleGen opened a pull request: https://github.com/apache/kafka/pull/3894 KAFKA-5928: Avoid redundant requests to zookeeper when reassign topic partition We mistakenly request topic level information according to partitions config in the assignment json file. For example https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/admin/ReassignPartitionsCommand.scala#L550: ``` val validPartitions = proposedPartitionAssignment.filter { case (p, _) => validatePartition(zkUtils, p.topic, p.partition) } ``` If reassign 1000 partitions (in 10 topics), we need to request zookeeper 1000 times here. But actually we only need to request just 10 (topics) times. We test a large-scale assignment, about 10K partitions. It takes tens of minutes. After optimization, it will reduce to less than 1minute. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/kafka KAFKA-5928 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3894.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3894 commit f6c30e81c7110f72e254bb9dfa81a25f951b70a1 Author: æ¨è® Date: 2017-09-19T03:01:20Z Avoid redundant requests to zookeeper when reassign topic partition ---
[GitHub] spark pull request #17395: [SPARK-20065][SS][WIP] Avoid to output empty parq...
Github user uncleGen closed the pull request at: https://github.com/apache/spark/pull/17395 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17052 @HyukjinKwon Sorry! Busy for this period of time. Let me resolve this conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17913 @zsxwing Great! Close this pr then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...
Github user uncleGen closed the pull request at: https://github.com/apache/spark/pull/17913 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17917: [SPARK-20600][SS] KafkaRelation should be pretty ...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17917#discussion_r115659920 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala --- @@ -143,4 +143,6 @@ private[kafka010] class KafkaRelation( validateTopicPartitions(partitions, partitionOffsets) } } + + override def toString: String = "kafka" } --- End diff -- How about giving some more information about the kafka source? like topic, partition? refers to https://github.com/jaceklaskowski/spark/blob/2ffe4476553cfe50eb6392d8e573545a92fef737/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L140 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17913#discussion_r115659132 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala --- @@ -48,7 +48,7 @@ case class StreamingRelation(dataSource: DataSource, sourceName: String, output: * Used to link a streaming [[Source]] of data into a * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]. */ -case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode { +case class StreamingSourceRelation(source: Source, output: Seq[Attribute]) extends LeafNode { override def isStreaming: Boolean = true --- End diff -- just one renaming --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17913 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17913 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17896#discussion_r115415803 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] { } /** + * Ignore event time watermark in batch query, which is only supported in Structured Streaming. + * TODO: add this rule into analyzer rule list. + */ +object CheckEventTimeWatermark extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case EventTimeWatermark(_, _, child) if !child.isStreaming => + logWarning("EventTime watermark is only supported in Structured Streaming but found " + --- End diff -- got --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17896#discussion_r115415668 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] { } /** + * Ignore event time watermark in batch query, which is only supported in Structured Streaming. + * TODO: add this rule into analyzer rule list. + */ +object CheckEventTimeWatermark extends Rule[LogicalPlan] { --- End diff -- @zsxwing This pr does some prepare work before we add `EliminateEventTimeWatermark ` into `Analyzer.batches`. Could you please take a review? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17913#discussion_r115415483 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala --- @@ -64,8 +64,20 @@ case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) ext } } -object StreamingExecutionRelation { - def apply(source: Source): StreamingExecutionRelation = { -StreamingExecutionRelation(source, source.schema.toAttributes) +case class StreamingRelationWrapper(child: LogicalPlan) extends UnaryNode { + override def isStreaming: Boolean = true + override def output: Seq[Attribute] = child.output +} + --- End diff -- Add a new `StreamingRelationWrapper` relation to wrap the internal relation in each trigger. It keeps the `isStreaming` property. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17896 Depends upon: [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17913 [SPARK-20672][SS] Keep the `isStreaming` property in triggerLogicalPlan in Structured Streaming ## What changes were proposed in this pull request? In Structured Streaming, the "isStreaming" property will be eliminated in each triggerLogicalPlan. Then, some rules will be applied to this triggerLogicalPlan mistakely. So, we should refactor existing code to better execute batch query and ss query. ## How was this patch tested? existing ut. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-20672 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17913.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17913 commit d1c4cbf0fa369db993855ef3f63b05561cf6662a Author: uncleGen Date: 2017-05-09T06:01:51Z Keep the `streaming` property in triggerLogicalPlan in Structured Streaming commit 20648d99b1b95ea074be56708f13901bba2ee10d Author: uncleGen Date: 2017-05-09T06:18:50Z update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17896 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17896 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17395: [SPARK-20065][SS][WIP] Avoid to output empty parquet fil...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17395 @HyukjinKwon Sorry for the long absence. I will keep online for next period of time. Please give me some time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17896 cc @zsxwing and @tdas --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17896 [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute ## What changes were proposed in this pull request? Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan. The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way. Changes: - In this PR, we add a new rule `CheckEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query. Followups: - Add `CheckEventTimeWatermark` into analyzer rule list. We can not add this rule into analyzer directly, because streaming query will be copied to a internal batch query in every trigger, and the rule will be applied to this internal batch query mistakenly. IIUC, we should refactor related codes to better define a query is batch or streaming. Right? Others: - A typo fix in example. ## How was this patch tested? add new unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-20373 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17896.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17896 commit 563721241851751c2bb1736161febe73b8abba3b Author: uncleGen Date: 2017-05-08T03:19:35Z Ignore event time watermark in batch query. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apac...
Github user uncleGen closed the pull request at: https://github.com/apache/spark/pull/17463 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17463 @srowen It's hard to say it's because shutting down SparkContext is the slow part, and we can improve this case by avoiding stooping SparkContext in a separate thread. cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apac...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17463 [SPARK-20131][DStream][Test] Flaky Test: org.apache.spark.streaming.StreamingContextSuite ## What changes were proposed in this pull request? do not stop the `SparkContext` in thread. ## How was this patch tested? Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-20131 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17463.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17463 commit 89d1d35562bdb47c54464f31adeddadbe3a3ec1b Author: uncleGen Date: 2017-03-29T04:43:49Z Flaky Test: org.apache.spark.streaming.StreamingContextSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17395: [SPARK-20065][SS] Avoid to output empty parquet files
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17395 Let me change this pr into WIP based on the discussion with @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17395#discussion_r107821138 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -292,7 +292,10 @@ object FileFormatWriter extends Logging { override def execute(iter: Iterator[InternalRow]): Set[String] = { var fileCounter = 0 var recordsInFile: Long = 0L - newOutputWriter(fileCounter) + // Skip the empty partition to avoid creating a mass of 'empty' files. + if (iter.hasNext) { +newOutputWriter(fileCounter) --- End diff -- Thanks for your prompt. How about just left one empty file containing the metadata when df has empty partition? Furthmore, we may just left one metadata file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17395#discussion_r107650070 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -292,7 +292,10 @@ object FileFormatWriter extends Logging { override def execute(iter: Iterator[InternalRow]): Set[String] = { var fileCounter = 0 var recordsInFile: Long = 0L - newOutputWriter(fileCounter) + // Skip the empty partition to avoid creating a mass of 'empty' files. + if (iter.hasNext) { +newOutputWriter(fileCounter) --- End diff -- @HyukjinKwon IIUC, this case should fail as expected, as there is no output. Am i missing something? ``` spark.range(100).filter("id > 100").write.parquet("/tmp/abc") spark.read.parquet("/tmp/abc").show() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17395#discussion_r107637615 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -292,7 +292,10 @@ object FileFormatWriter extends Logging { override def execute(iter: Iterator[InternalRow]): Set[String] = { var fileCounter = 0 var recordsInFile: Long = 0L - newOutputWriter(fileCounter) + // Skip the empty partition to avoid creating a mass of 'empty' files. + if (iter.hasNext) { +newOutputWriter(fileCounter) --- End diff -- Let me see how to cover this case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16972: [SPARK-19556][CORE][WIP] Broadcast data is not encrypted...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16972 close it before i have a better solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...
Github user uncleGen closed the pull request at: https://github.com/apache/spark/pull/16972 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17395 [SPARK-20065][SS] Avoid to output empty parquet files ## Problem Description Reported by Silvio Fiorito I've got a Kafka topic which I'm querying, running a windowed aggregation, with a 30 second watermark, 10 second trigger, writing out to Parquet with append output mode. Every 10 second trigger generates a file, regardless of whether there was any data for that trigger, or whether any records were actually finalized by the watermark. Is this expected behavior or should it not write out these empty files? ``` val df = spark.readStream.format("kafka") val query = df .withWatermark("timestamp", "30 seconds") .groupBy(window($"timestamp", "10 seconds")) .count() .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count") query .writeStream .format("parquet") .option("checkpointLocation", aggChk) .trigger(ProcessingTime("10 seconds")) .outputMode("append") .start(aggPath) ``` As the query executes, do a file listing on "aggPath" and you'll see 339 byte files at a minimum until we arrive at the first watermark and the initial batch is finalized. Even after that though, as there are empty batches it'll keep generating empty files every trigger. ## What changes were proposed in this pull request? Check the partition is empty or not, and skip empty partition to avoid output empty file. ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-20065 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17395 commit 86a7d2fa96e3134c1e64864eba81a3bebdedceea Author: uncleGen Date: 2017-03-23T08:10:31Z avoid to output empty parquet files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17371: [SPARK-19903][PYSPARK][SS] window operator miss the `wat...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17371 @viirya Great, you give a more clear explanation. > I am thinking, should we create new expression id for the watermarking column with withWatermark? So we must write the query like: It really can fix this problem, but not very user-friendly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17052 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17371: [SPARK-19903][PYSPARK][SS] window operator miss t...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17371#discussion_r107101883 --- Diff: python/pyspark/sql/functions.py --- @@ -1163,7 +1163,10 @@ def check_string_field(field, fieldName): raise TypeError("%s should be provided as a string" % fieldName) sc = SparkContext._active_spark_context -time_col = _to_java_column(timeColumn) +if isinstance(timeColumn, Column): --- End diff -- @viirya Sounds reasonable, I pushed an update, take a review please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17371: [SPARK-19903][PYSPARK][SS] window operator miss t...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17371#discussion_r107095629 --- Diff: python/pyspark/sql/functions.py --- @@ -1163,7 +1163,10 @@ def check_string_field(field, fieldName): raise TypeError("%s should be provided as a string" % fieldName) sc = SparkContext._active_spark_context -time_col = _to_java_column(timeColumn) +if isinstance(timeColumn, Column): --- End diff -- IIUC, it is OK for current codebase. Am I missing something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17371: [SPARK-19903][PYSPARK][SS] window operator miss t...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17371 [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column ## What changes were proposed in this pull request? reproduce code: ``` import sys from pyspark.sql import SparkSession from pyspark.sql.functions import explode, split, window bootstrapServers = sys.argv[1] subscribeType = sys.argv[2] topics = sys.argv[3] spark = SparkSession\ .builder\ .appName("StructuredKafkaWordCount")\ .getOrCreate() lines = spark\ .readStream\ .format("kafka")\ .option("kafka.bootstrap.servers", bootstrapServers)\ .option(subscribeType, topics)\ .load()\ .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") words = lines.select(explode(split(lines.value, ' ')).alias('word'),lines.timestamp) windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( window(words.timestamp, "30 seconds", "30 seconds"), words.word ).count() query = windowedCounts\ .writeStream\ .outputMode('append')\ .format('console')\ .option("truncate", "false")\ .start() query.awaitTermination() ``` An exception was thrown: ``` pyspark.sql.utils.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; Aggregate [window#32, word#21], [window#32 AS window#26, word#21, count(1) AS count#31L] +- Filter ((timestamp#16 >= window#32.start) && (timestamp#16 < window#32.end)) +- Expand [ArrayBuffer(named_struct(start, ...] +- EventTimeWatermark timestamp#16: timestamp, interval 10 seconds +- Project [word#21, timestamp#16] +- Generate explode(split(value#15, )), true, false, [word#21] +- Project [cast(value#1 as string) AS value#15, cast(timestamp#5 as timestamp) AS timestamp#16] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession ...] ``` IIUC, the root cause is: `words.withWatermark("timestamp", "30 seconds")` add the watermark metadata into time column, but in `groupBy( window(words.timestamp, "30 seconds", "30 seconds"), words.word )`, the `words.timestamp` miss the metadata. At last, it failed to pass the check: ``` if (watermarkAttributes.isEmpty) { throwError( s"$outputMode output mode not supported when there are streaming aggregations on " + s"streaming DataFrames/DataSets without watermark")(plan) } ``` after pr, run successfully. ## How was this patch tested? Jenkins Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark python-window Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17371.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17371 commit 654c5121fd26a85036787882d3d2c3b56360b686 Author: uncleGen Date: 2017-03-21T07:49:11Z bug fix: window operator miss the `watermark` metadata of time column --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17052 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17352: [SPARK-20021][PySpark] Miss backslash in python code
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17352 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17352: Miss backslash in python code
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17352 Miss backslash in python code ## What changes were proposed in this pull request? Add backslash for line continuation in python code. ## How was this patch tested? Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark python-example-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17352.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17352 commit d0a1c9f15288a4af4b3a3e12a89aff94d7104f7a Author: uncleGen Date: 2017-03-13T02:58:23Z fix python example in doc commit 965dce3d8707cadadf59594dc88310e2224ffeef Author: uncleGen Date: 2017-03-20T02:00:06Z Merge branch 'master' into python-example-doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17216 Does this PR mix in some test file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17267 @srowen Could you please take a view and help to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17267 ping @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17267: [SPARK-19926][PYSPARK] Make pyspark exception mor...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17267#discussion_r105827541 --- Diff: python/pyspark/sql/utils.py --- @@ -24,7 +24,7 @@ def __init__(self, desc, stackTrace): self.stackTrace = stackTrace def __str__(self): -return repr(self.desc) +return str(self.desc) --- End diff -- based on latest commit: ``` >>> df.select("ì") Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select jdf = self._jdf.select(self._jcols(*cols)) File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 75, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException : cannot resolve '`ì`' given input columns: [age, name];; 'Project ['ì] +- Relation[age#0L,name#1] json --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17267 Thanks @HyukjinKwonï¼you give a good catchï¼I lost that case. Thanks @viirya for your suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17267 IMHO, yes. And @viirya is the original author. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17267 @viirya Thanks for you review. cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17209#discussion_r105575677 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -606,6 +607,24 @@ class KafkaSourceSuite extends KafkaSourceTest { assert(query.exception.isEmpty) } + test("test to get offsets from case insensitive parameters") { --- End diff -- Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17267 Maybe @viirya can give some suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17267: [SPARK-19926][PYSPARK] Make pyspark exception mor...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17267 [SPARK-19926][PYSPARK] Make pyspark exception more readable ## What changes were proposed in this pull request? Exception in pyspark is a little difficult to read. before pr, like: ``` Traceback (most recent call last): File "", line 5, in File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in start return self._sq(self._jwrite.start()) File "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;\nAggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS count#16L]\n+- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end))\n +- Expand [ArrayBuffer(named_struct(start, CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 3000)), word#5, t#6-T3ms), ArrayBuffer(named_struct(start, CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 3000)), word#5, t#6-T3ms)], [window#17, word#5, t#6-T3ms]\n +- EventTimeWatermark t#6: timestamp, interval 30 seconds\n +- Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]\n+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@c4079ca,csv,List(),Some(StructType(StructField(word,StringType,true), StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]\n' ``` after pr: ``` Traceback (most recent call last): File "", line 5, in File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in start return self._sq(self._jwrite.start()) File "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; Aggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS count#16L] +- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end)) +- Expand [ArrayBuffer(named_struct(start, CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 3000)), word#5, t#6-T3ms), ArrayBuffer(named_struct(start, CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 3000)), word#5, t#6-T3ms)], [window#17, word#5, t#6-T3ms] +- EventTimeWatermark t#6: timestamp, interval 30 seconds +- Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@5265083b,csv,List(),Some(StructType(StructField(word,StringType,true), StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1] ``` ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-19926 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17267.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17267 commit 273c1bc8d71915
[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17209#discussion_r105553128 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala --- @@ -128,18 +123,18 @@ private[kafka010] class KafkaSourceProvider extends DataSourceRegister .map { k => k.drop(6).toString -> parameters(k) } .toMap -val startingRelationOffsets = - caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match { -case Some("earliest") => EarliestOffsetRangeLimit -case Some(json) => SpecificOffsetRangeLimit(JsonUtils.partitionOffsets(json)) -case None => EarliestOffsetRangeLimit +val startingRelationOffsets = KafkaSourceProvider.getKafkaOffsetRangeLimit( + caseInsensitiveParams, STARTING_OFFSETS_OPTION_KEY, EarliestOffsetRangeLimit) match { +case earliest @ EarliestOffsetRangeLimit => earliest --- End diff -- ð much more simple --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17257: [DOCS][SS] fix structured streaming python exampl...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17257 [DOCS][SS] fix structured streaming python example ## What changes were proposed in this pull request? - SS python example: `TypeError: 'xxx' object is not callable` - some other doc issue. ## How was this patch tested? Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark docs-ss-python Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17257 commit cd8269022e3066d07c6fc480305ccef89efd0993 Author: uncleGen Date: 2017-03-11T09:27:40Z fix structured streaming python example code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17209#discussion_r105528025 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala --- @@ -450,10 +445,22 @@ private[kafka010] class KafkaSourceProvider extends DataSourceRegister private[kafka010] object KafkaSourceProvider { private val STRATEGY_OPTION_KEYS = Set("subscribe", "subscribepattern", "assign") - private val STARTING_OFFSETS_OPTION_KEY = "startingoffsets" - private val ENDING_OFFSETS_OPTION_KEY = "endingoffsets" + private[kafka010] val STARTING_OFFSETS_OPTION_KEY = "startingoffsets" + private[kafka010] val ENDING_OFFSETS_OPTION_KEY = "endingoffsets" private val FAIL_ON_DATA_LOSS_OPTION_KEY = "failondataloss" --- End diff -- change for unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17209: [SPARK-19853][SS] uppercase kafka topics fail when start...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17209 @zsxwing doneï¼but forgot to pushï¼I will update it as soon as possible when I connect to internet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17202: [SPARK-19861][SS] watermark should not be a negative tim...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17202 cc @srowen and @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17221: [SPARK-19859][SS][Follow-up] The new watermark should ov...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17221 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17209: [SPARK-19853][SS] uppercase kafka topics fail when start...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17209 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r105075392 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,8 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +require(parsedDelay.milliseconds >= 0 && parsedDelay.months >= 0, + s"delay threshold ($delayThreshold) should not be negative.") --- End diff -- use `require(parsedDelay.milliseconds >= 0 && parsedDelay.months >= 0)` to make `delayThreshold` more reasonable and significative. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17221: [SPARK-19859][SS][Follow-up] The new watermark sh...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17221 [SPARK-19859][SS][Follow-up] The new watermark should override the old one. ## What changes were proposed in this pull request? A follow up to SPARK-19859: - extract the calculation of `delayMs` and reuse it. - update EventTimeWatermarkExec - use the correct `delayMs` in EventTimeWatermark ## How was this patch tested? Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-19859 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17221.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17221 commit 2c2c8062921e83b4c8ae19afa2a0e199cd7a4814 Author: uncleGen Date: 2017-03-09T02:26:35Z follow-up to SPARK-19859 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r105069930 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,11 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +val delayMs = { --- End diff -- sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17216: [SPARK-19873][SS] Record num shuffle partitions i...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17216#discussion_r105069281 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -380,7 +382,20 @@ class StreamExecution( logInfo(s"Resuming streaming query, starting with batch $batchId") currentBatchId = batchId availableOffsets = nextOffsets.toStreamProgress(sources) -offsetSeqMetadata = nextOffsets.metadata.getOrElse(OffsetSeqMetadata()) +val numShufflePartitionsFromConf = sparkSession.conf.get(SQLConf.SHUFFLE_PARTITIONS) +offsetSeqMetadata = nextOffsets + .metadata + .getOrElse(OffsetSeqMetadata(0, 0, numShufflePartitionsFromConf)) + +/* + * For backwards compatibility, if # partitions was not recorded in the offset log, then + * ensure it is non-zero. The new value is picked up from the conf. + */ --- End diff -- for inline comment with the code, use // and not /* .. */. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r105067664 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,11 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +val delayMs = { + val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31 + parsedDelay.milliseconds + parsedDelay.months * millisPerMonth +} +assert(delayMs >= 0, s"delay threshold should not be a negative time: $delayThreshold") --- End diff -- @srowen +1 to your comments. We should make it significative but not just valid. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r104930545 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,11 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +val delayMs = { + val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31 + parsedDelay.milliseconds + parsedDelay.months * millisPerMonth +} +assert(delayMs >= 0, s"delay threshold should not be a negative time: $delayThreshold") --- End diff -- Maybe you misunderstand my cases above. Those cases are invalid, i.e. the `parsedDelay` are negative. |cases|validity| ||--| |inputData.withWatermark("value", "1 month -40 days")|invalid| |inputData.withWatermark("value", "-10 seconds")|invalid| |inputData.withWatermark("value", "10 seconds")|valid| |inputData.withWatermark("value", "1 day -10 seconds")|valid| --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r104904202 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,11 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +val delayMs = { + val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31 + parsedDelay.milliseconds + parsedDelay.months * millisPerMonth +} +assert(delayMs >= 0, s"delay threshold should not be a negative time: $delayThreshold") --- End diff -- `delayThreshold: String` can not be used to assert directly, like: ``` inputData.withWatermark("value", "1 month -40 days") inputData.withWatermark("value", "-10 seconds") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17209 [SPARK-19853][SS] uppercase kafka topics fail when startingOffsets are SpecificOffsets ## What changes were proposed in this pull request? When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. KafkaSourceProvider.scala: ``` val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match { case Some("latest") => LatestOffsets case Some("earliest") => EarliestOffsets case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) case None => LatestOffsets } ``` ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-19853 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17209.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17209 commit e2a26bf8fb8554fb030e7f5bd2197befb9ed55e2 Author: uncleGen Date: 2017-03-08T11:59:17Z Uppercase Kafka topics fail when startingOffsets are SpecificOffsets --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r104899893 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,11 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +val delayMs = { + val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31 + parsedDelay.milliseconds + parsedDelay.months * millisPerMonth +} +assert(delayMs >= 0, s"delay threshold should not be a negative time: $delayThreshold") --- End diff -- @srowen Thanks for you review! > Why compute all this -- don't you just mean to assert about delayThreshold? I do mean to check the `delayThreshold`. `delayThreshold` is converted from `String` to `CalendarInterval`. `CalendarInterval` divides the `delayThreshold` into two parts, i.e. month (contain year and month) and microseconds of rest. (https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java#L86) (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala#L87) > this derived value can only be negative if the input is right? Sorry, I dont get what you mean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17202#discussion_r104888633 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -576,6 +576,8 @@ class Dataset[T] private[sql]( val parsedDelay = Option(CalendarInterval.fromString("interval " + delayThreshold)) .getOrElse(throw new AnalysisException(s"Unable to parse time delay '$delayThreshold'")) +assert(parsedDelay.microseconds >= 0, --- End diff -- set 0 means event time should be not less than max event time in last batch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17202: [SPARK-19861][SS] watermark should not be a negative tim...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17202 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17202 [SPARK-19861][SS] watermark should not be a negative time. ## What changes were proposed in this pull request? watermark should not be a negative time. ## How was this patch tested? add new unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-19861 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17202.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17202 commit dcc77eda3d88f8cd5c66b60730c0ada5ae717cc3 Author: uncleGen Date: 2017-03-08T03:28:29Z watermark should not be a negative time. commit 00bbadc063e515653af86ab85fc95833bccf727a Author: uncleGen Date: 2017-03-08T03:30:13Z update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17052: [SPARK-19690][SS] Join a streaming DataFrame with...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17052#discussion_r104651490 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala --- @@ -57,10 +60,31 @@ trait Source { def getBatch(start: Option[Offset], end: Offset): DataFrame /** + * In a streaming query, stream relation will be cut into a series of batch relations. + * We need to mark the batch relation as streaming, i.e. data coming from a stream source, + * so we can apply those streaming strategies to it. + */ + def markAsStreaming(df: DataFrame): DataFrame = { +val markAsStreaming = df.logicalPlan transform { + case logicalRDD @ LogicalRDD(_, _, _, _, false) => +logicalRDD.dataFromStreaming = true +logicalRDD + case logicalRelation @ LogicalRelation(_, _, _, false) => +logicalRelation.dataFromStreaming = true +logicalRelation + case localRelation @ LocalRelation(_, _, false) => +localRelation.dataFromStreaming = true +localRelation +} + --- End diff -- add a new parameter `dataFromStreaming ` to the constructor of LogicalRelation, LogicalRDD and LocalRelation. `dataFromStreaming ` indicate if this relation comes from a streaming source. In a streaming query, stream relation will be cut into a series of batch relations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17144#discussion_r104635754 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends BlockManagerReplicationBehav val newLocations = master.getLocations(blockId).toSet logInfo(s"New locations : $newLocations") -assert(newLocations.size === replicationFactor) +eventually(timeout(5 seconds), interval(10 millis)) { + assert(newLocations.size === replicationFactor) --- End diff -- Ahhh, got it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17141: [SPARK-19800][SS][WIP] Implement one kind of streaming s...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17141 ping @tdas and @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17144#discussion_r104630604 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends BlockManagerReplicationBehav val newLocations = master.getLocations(blockId).toSet logInfo(s"New locations : $newLocations") -assert(newLocations.size === replicationFactor) +eventually(timeout(5 seconds), interval(10 millis)) { + assert(newLocations.size === replicationFactor) --- End diff -- @srowen Please view the discussion here. Maybe we should keep the first sleep. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17144#discussion_r104619252 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends BlockManagerReplicationBehav val newLocations = master.getLocations(blockId).toSet logInfo(s"New locations : $newLocations") -assert(newLocations.size === replicationFactor) +eventually(timeout(5 seconds), interval(10 millis)) { + assert(newLocations.size === replicationFactor) --- End diff -- IMHO, we can not remove the first sleep. For example there are three blockmanager A, B, C. When we stats to remove BM-A, all blocks in BM-A will be replicated to BM-B and BM-C. We can not remove BM-B immediately or too fast, as there may be no enough time to do replication and new block info may can not be registered to master properly. So, we should instead give a little more time to sleep just like my fist fix. But it is OK to remove the second sleep. @kayousterhout Tell me if i was missing something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17144 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17144#discussion_r104570057 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends BlockManagerReplicationBehav val newLocations = master.getLocations(blockId).toSet logInfo(s"New locations : $newLocations") -assert(newLocations.size === replicationFactor) +eventually(timeout(5 seconds), interval(10 millis)) { + assert(newLocations.size === replicationFactor) +} // there should only be one common block manager between initial and new locations --- End diff -- continually check a condition and then timeout after 5 seconds --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpoin...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17167#discussion_r104310809 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala --- @@ -152,11 +152,9 @@ trait DStreamCheckpointTester { self: SparkFunSuite => stopSparkContext: Boolean ): Seq[Seq[V]] = { try { - val batchDuration = ssc.graph.batchDuration --- End diff -- @srowen Yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17145 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17167 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17167 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16656: [SPARK-18116][DStream] Report stream input information a...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16656 ping @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpoin...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17167 [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string. ## What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/ ``` org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 617 times over 10.003740484 seconds. Last failure message: 8 did not equal 2. ``` the check condition is: ``` val checkpointFilesOfLatestTime = Checkpoint.getCheckpointFiles(checkpointDir).filter { _.toString.contains(clock.getTimeMillis.toString) } // Checkpoint files are written twice for every batch interval. So assert that both // are written to make sure that both of them have been written. assert(checkpointFilesOfLatestTime.size === 2) ``` the path string may contain the `clock.getTimeMillis.toString`, like: ``` file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500 ------ ``` so we should only check the filename, but not the while path. ## How was this patch tested? Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark flaky-CheckpointSuite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17167.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17167 commit 72f1963a36f9f1abfe8ca10d30b01f52c2281d82 Author: uncleGen Date: 2017-03-03T10:11:52Z flaky CheckpointSuite test failure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17144 @kayousterhout sure, I was being doing that flaky test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17145: [SPARK-19805][TEST] Log the row type when query r...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17145#discussion_r104304108 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -312,13 +312,23 @@ object QueryTest { sparkAnswer: Seq[Row], isSorted: Boolean = false): Option[String] = { if (prepareAnswer(expectedAnswer, isSorted) != prepareAnswer(sparkAnswer, isSorted)) { + val getRowType: Option[Row] => String = row => +"RowType" + row.map(row => --- End diff -- @hvanhovell After use `schema.catalogString` ``` !== Correct Answer - 1 == == Spark Answer - 1 == !struct<_1:string,_2:string> struct<_1:int,_2:string> ![1,a] [1,a] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17145: [SPARK-19805][TEST] Log the row type when query r...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17145#discussion_r104157326 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -312,13 +312,23 @@ object QueryTest { sparkAnswer: Seq[Row], isSorted: Boolean = false): Option[String] = { if (prepareAnswer(expectedAnswer, isSorted) != prepareAnswer(sparkAnswer, isSorted)) { + val getRowType: Option[Row] => String = row => +"RowType" + row.map(row => --- End diff -- OK, Iet me have a test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17145 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17144 cc @kayousterhout --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17144 test crash. retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17145 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17145 unrelated failure: ` org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false`. retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/14731 @srowen Waiting for your final OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17144 one more flaky test? `org.apache.spark.streaming.CheckpointSuite.recovery with map and reduceByKey operations` I will check it later. retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17080: [SPARK-19739][CORE] propagate S3 session token to cluser
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17080 cc @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org