[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21931 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22187 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21931 **[Test build #95190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95190/testReport)** for PR 21931 at commit [`6abeb06`](https://github.com/apache/spark/commit/6abeb06e52ed51ff9a30acc5801b0794ce778e2c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22207: [SPARK-25214][SS]Fix the issue that Kafka v2 sour...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22207#discussion_r212526373 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.mutable +import scala.util.Random + +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter} +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +/** + * This is a basic test trait which will set up a Kafka cluster that keeps only several records in + * a topic and ages out records very quickly. This is a helper trait to test + * "failonDataLoss=false" case with missing offsets. + * + * Note: there is a hard-code 30 seconds delay (kafka.log.LogManager.InitialTaskDelayMs) to clean up + * records. Hence each class extending this trait needs to wait at least 30 seconds (or even longer + * when running on a slow Jenkins machine) before records start to be removed. To make sure a test + * does see missing offsets, you can check the earliest offset in `eventually` and make sure it's + * not 0 rather than sleeping a hard-code duration. + */ +trait KafkaMissingOffsetsTest extends SharedSQLContext { + + protected var testUtils: KafkaTestUtils = _ + + override def createSparkSession(): TestSparkSession = { +// Set maxRetries to 3 to handle NPE from `poll` when deleting a topic +new TestSparkSession(new SparkContext("local[2,3]", "test-sql-context", sparkConf)) + } + + override def beforeAll(): Unit = { +super.beforeAll() +testUtils = new KafkaTestUtils { + override def brokerConfiguration: Properties = { +val props = super.brokerConfiguration +// Try to make Kafka clean up messages as fast as possible. However, there is a hard-code +// 30 seconds delay (kafka.log.LogManager.InitialTaskDelayMs) so this test should run at +// least 30 seconds. +props.put("log.cleaner.backoff.ms", "100") +// The size of RecordBatch V2 increases to support transactional write. +props.put("log.segment.bytes", "70") +props.put("log.retention.bytes", "40") +props.put("log.retention.check.interval.ms", "100") +props.put("delete.retention.ms", "10") +props.put("log.flush.scheduler.interval.ms", "10") +props + } +} +testUtils.setup() + } + + override def afterAll(): Unit = { +if (testUtils != null) { + testUtils.teardown() + testUtils = null +} +super.afterAll() + } +} + +class KafkaDontFailOnDataLossSuite extends KafkaMissingOffsetsTest { + + import testImplicits._ + + private val topicId = new AtomicInteger(0) + + private def newTopic(): String = s"failOnDataLoss-${topicId.getAndIncrement()}" + + /** + * @param testStreamingQuery whether to test a streaming query or a batch query. + * @param writeToTable the function to write the specified [[DataFrame]] to the given table. + */ + private def verifyMissingOffsetsDontCauseDuplicatedRecords( + testStreamingQuery: Boolean)(writeToTable: (DataFrame, String) => Unit): Unit = { +val topic = newTopic() +testUtils.createTopic(topic, partitions = 1) +testUtils.sendMessages(topic, (0 until 50).map(_.toString).toArray) + +eventually(timeout(60.seconds)) { + assert( +testUtils.getEarliestOffsets(Set(topic)).head._2 > 0, +"Kafka didn't delete records after 1 minute") +} + +
[GitHub] spark pull request #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-251...
Github user bersprockets closed the pull request at: https://github.com/apache/spark/pull/22079 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212507190 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala --- @@ -160,6 +160,23 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext { s"AddKafkaData(topics = $topics, data = $data, message = $message)" } + object WithKafkaProducer { +def apply( +topic: String, +producer: KafkaProducer[String, String])( --- End diff -- Ping on this comment. Maybe you missed this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212522664 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala --- @@ -161,6 +161,22 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext with Kaf s"AddKafkaData(topics = $topics, data = $data, message = $message)" } + object WithKafkaProducer { --- End diff -- nit: This is not creating a KafkaProducer .. as most `With***` methods. The point of this is to force synchronization of the consumer. So maybe rename it to `WithOffsetSync { ... }`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212521083 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -239,56 +335,74 @@ private[kafka010] case class InternalKafkaConsumer( } /** - * Get the record for the given offset if available. Otherwise it will either throw error - * (if failOnDataLoss = true), or return the next available offset within [offset, untilOffset), - * or null. + * Get the fetched record for the given offset if available. + * + * If the record is invisible (either a transaction message, or an aborted message when the + * consumer's `isolation.level` is `read_committed`), it will return a `FetchedRecord` with the + * next offset to fetch. + * + * This method also will try the best to detect data loss. If `failOnDataLoss` is true`, it will + * throw an exception when we detect an unavailable offset. If `failOnDataLoss` is `false`, this + * method will return `null` if the next available record is within [offset, untilOffset). * * @throws OffsetOutOfRangeException if `offset` is out of range * @throws TimeoutException if cannot fetch the record in `pollTimeoutMs` milliseconds. */ - private def fetchData( + private def fetchRecord( offset: Long, untilOffset: Long, pollTimeoutMs: Long, - failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]] = { -if (offset != nextOffsetInFetchedData || !fetchedData.hasNext()) { - // This is the first fetch, or the last pre-fetched data has been drained. - // Seek to the offset because we may call seekToBeginning or seekToEnd before this. - seek(offset) - poll(pollTimeoutMs) -} - -if (!fetchedData.hasNext()) { - // We cannot fetch anything after `poll`. Two possible cases: - // - `offset` is out of range so that Kafka returns nothing. Just throw - // `OffsetOutOfRangeException` to let the caller handle it. - // - Cannot fetch any data before timeout. TimeoutException will be thrown. - val range = getAvailableOffsetRange() - if (offset < range.earliest || offset >= range.latest) { -throw new OffsetOutOfRangeException( - Map(topicPartition -> java.lang.Long.valueOf(offset)).asJava) + failOnDataLoss: Boolean): FetchedRecord = { +if (offset != fetchedData.nextOffsetInFetchedData) { + // This is the first fetch, or the fetched data has been reset. + // Fetch records from Kafka and update `fetchedData`. + fetchData(offset, pollTimeoutMs) +} else if (!fetchedData.hasNext) { // The last pre-fetched data has been drained. + if (offset < fetchedData.offsetAfterPoll) { +// Offsets in [offset, fetchedData.offsetAfterPoll) are invisible. Return a record to ask +// the next call to start from `fetchedData.offsetAfterPoll`. +fetchedData.reset() +return fetchedRecord.withRecord(null, fetchedData.offsetAfterPoll) } else { -throw new TimeoutException( - s"Cannot fetch record for offset $offset in $pollTimeoutMs milliseconds") +// Fetch records from Kafka and update `fetchedData`. +fetchData(offset, pollTimeoutMs) } +} + +if (!fetchedData.hasNext) { + // When we reach here, we have already tried to poll from Kafka. As `fetchedData` is still + // empty, all messages in [offset, fetchedData.offsetAfterPoll) are invisible. Return a + // record to ask the next call to start from `fetchedData.offsetAfterPoll`. + assert(offset <= fetchedData.offsetAfterPoll, +s"seek to $offset and poll but the offset was reset to ${fetchedData.offsetAfterPoll}") + fetchedRecord.withRecord(null, fetchedData.offsetAfterPoll) } else { val record = fetchedData.next() - nextOffsetInFetchedData = record.offset + 1 // In general, Kafka uses the specified offset as the start point, and tries to fetch the next // available offset. Hence we need to handle offset mismatch. if (record.offset > offset) { +val range = getAvailableOffsetRange() +if (range.earliest <= offset) { + // `offset` is still valid but the corresponding message is invisible. We should skip it + // and jump to `record.offset`. Here we move `fetchedData` back so that the next call of + // `fetchRecord` can just return `record` directly. + fetchedData.previous() + return fetchedRecord.withRecord(null, record.offset) +} // This
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212522432 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala --- @@ -161,6 +161,22 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext with Kaf s"AddKafkaData(topics = $topics, data = $data, message = $message)" } + object WithKafkaProducer { +def apply( +topic: String, +producer: KafkaProducer[String, String])( +func: KafkaProducer[String, String] => Unit): AssertOnQuery = { --- End diff -- nit: AssertOnQuery -> StreamAction --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212504622 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala --- @@ -331,6 +331,7 @@ private[kafka010] case class KafkaMicroBatchPartitionReader( offsetRange.topicPartition, executorKafkaParams, reuseKafkaConsumer) private val rangeToRead = resolveRange(offsetRange) + --- End diff -- unnecessary --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21087 **[Test build #95194 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95194/testReport)** for PR 21087 at commit [`ebd9265`](https://github.com/apache/spark/commit/ebd926530c1d8b2f515a4a233f5963eafc17e460). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r212521169 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -164,9 +165,12 @@ case class BucketSpec( numBuckets: Int, bucketColumnNames: Seq[String], sortColumnNames: Seq[String]) { - if (numBuckets <= 0 || numBuckets >= 10) { + def conf: SQLConf = SQLConf.get + + if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { --- End diff -- We can merge this PR first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21087 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r212521145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -164,9 +165,12 @@ case class BucketSpec( numBuckets: Int, bucketColumnNames: Seq[String], sortColumnNames: Seq[String]) { - if (numBuckets <= 0 || numBuckets >= 10) { + def conf: SQLConf = SQLConf.get + + if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { --- End diff -- Could you submit a followup PR to address this message issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #95193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95193/testReport)** for PR 21732 at commit [`5f95bd0`](https://github.com/apache/spark/commit/5f95bd0cf1bd308c7df55c41caef7a9f19368f5d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2518/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r212520685 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -19,25 +19,85 @@ package org.apache.spark.sql.execution.aggregate import scala.language.existentials -import org.apache.spark.sql.Encoder +import org.apache.spark.sql.{AnalysisException, Encoder} import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer -import org.apache.spark.sql.catalyst.encoders.encoderFor +import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, UnresolvedDeserializer} +import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateFunction, DeclarativeAggregate, TypedImperativeAggregate} import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection -import org.apache.spark.sql.catalyst.expressions.objects.Invoke +import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull, Invoke, NewInstance, WrapOption} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql.types._ import org.apache.spark.util.Utils object TypedAggregateExpression { + + // Checks if given encoder is for `Option[Product]`. + def isOptProductEncoder(encoder: ExpressionEncoder[_]): Boolean = { +// Only Option[Product] is non-flat. +encoder.clsTag.runtimeClass == classOf[Option[_]] && !encoder.flat + } + + /** + * Flattens serializers and deserializer of given encoder. We only flatten encoder + * of `Option[Product]` class. + */ + def flattenOptProductEncoder(encoder: ExpressionEncoder[_]): ExpressionEncoder[_] = { --- End diff -- I will go to add some tests against this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2517/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #95192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95192/testReport)** for PR 21732 at commit [`9fc3f61`](https://github.com/apache/spark/commit/9fc3f6165156051142a8366a32726badaaa16bb7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstra...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22203 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22079 @bersprockets Could you please close this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22203 Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21320 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2516/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21732 @cloud-fan I made attempt to remove `topLevel` parameter. The approach is to flatten serializers and deserialzer at `TypedAggregateExpression`. So users are not aware of difference when using expression encoder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 Thanks! Merged to master. BTW, we can keep thinking whether there are other better solutions for nested column pruning. Also cc @dongjoon-hyun If you are interested in the work for supporting ORC nested column pruning. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #95191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95191/testReport)** for PR 21732 at commit [`ed3d5cb`](https://github.com/apache/spark/commit/ed3d5cb697b10af2e2cf4c78ab521d4d0b2f3c9b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22211 cc @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22206 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95187/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22206 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95188/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22206 **[Test build #95187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95187/testReport)** for PR 22206 at commit [`550368e`](https://github.com/apache/spark/commit/550368eaeebdc87f2c89bad7214f2624784eeb04). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95188/testReport)** for PR 22190 at commit [`847300f`](https://github.com/apache/spark/commit/847300f76391c5e171e3f54b21bf6f2efc177f0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21273 @koertkuipers you wanna make a PR to make it configuration? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user gerashegalov commented on the issue: https://github.com/apache/spark/pull/22213 @witgo please take a look since you worked on #2379 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22213: [SPARK-25221][DEPLOY] Consistent trailing whitesp...
GitHub user gerashegalov opened a pull request: https://github.com/apache/spark/pull/22213 [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gerashegalov/spark gera/SPARK-25221 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22213 commit 874b2c82a83d94c338d8b1ef9c3b37074cc2e4cc Author: Gera Shegalov Date: 2018-08-23T19:09:05Z [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22211 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95189/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22211 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22211 **[Test build #95189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95189/testReport)** for PR 22211 at commit [`d269015`](https://github.com/apache/spark/commit/d26901516f414d16f8eb38183053d22dc3b056e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95185/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11105 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2515/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11105 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 i would suggest at least that when the quote character is changed that the empty value should change accordingly. an empty value of ```""``` makes no sense if the quote character is not ```"```. also if we could agree on a quote character that means no quotes at all then i would suggest to change empty value back to null if that particular quote character is set. because no quoted empty string makes sense if the user is trying to write out unquoted values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #95185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95185/testReport)** for PR 21320 at commit [`e6baf68`](https://github.com/apache/spark/commit/e6baf681e06e229d740af120491d1bf0f426af99). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22212 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22212 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22212 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22201: [SPARK-25209][SQL] Avoid deserializer check in Da...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22201 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22212: [SPARK-25220] Seperate kubernetes node selector c...
GitHub user jweaver-personal opened a pull request: https://github.com/apache/spark/pull/22212 [SPARK-25220] Seperate kubernetes node selector config between driver and executors. Seperated node selector config option between executors and driver. This removes the spark.kubernetes.node.selector config option and seperates it to spark.kubernetes.driver.selector and spark.kubernetes.executor.selector to allow seperate node selectors on drivers and executors. My personal use case for this change is that on AWS we have cheap spotinstances that can terminate at any moment which is okay for executors but not the driver. With a single node selector option I am unable to use the spot instances to save costs on the executor nodes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jweaver-personal/spark jweaver-dev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22212.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22212 commit a6c6371f8d0ba5e7ff55e83f7c8c955bf24096aa Author: ? Date: 2018-08-24T01:55:06Z KUBERNETES K8S Seperated node selector config option between executors and driver. This removes the spark.kubernetes.node.selector config option and seperates it to spark.kubernetes.driver.selector and spark.kubernetes.executor.selector to allow seperate node selectors on drivers and executors. My personal use case for this change is that on AWS we have cheap spotinstances that can terminate at any moment which is okay for executors but not the driver. With a single node selector option I am unable to use the spot instances to save costs on the executor nodes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21931 **[Test build #95190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95190/testReport)** for PR 21931 at commit [`6abeb06`](https://github.com/apache/spark/commit/6abeb06e52ed51ff9a30acc5801b0794ce778e2c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212504250 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- +1 to all your comments. I'm overhauling this whole PR and will force push with a rebase once it seems to basically work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20611 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20611 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22195 cc @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22195: [SPARK-25205][CORE] Fix typo in spark.network.cry...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22195 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22195 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22208: [SPARK-25216][SQL] Improve error message when a c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22208#discussion_r212500744 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -216,8 +216,16 @@ class Dataset[T] private[sql]( private[sql] def resolve(colName: String): NamedExpression = { queryExecution.analyzed.resolveQuoted(colName, sparkSession.sessionState.analyzer.resolver) .getOrElse { -throw new AnalysisException( - s"""Cannot resolve column name "$colName" among (${schema.fieldNames.mkString(", ")})""") +if (schema.fieldNames.contains(colName)) { + throw new AnalysisException( +s"""Cannot resolve column name "$colName" among (${schema.fieldNames.mkString(", ")}). + | Try adding backticks to the column name, i.e., `$colName`""" --- End diff -- I would explain, for instance, if the name parts in the column should be kept as the part of its column name, try to quote them by backticks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22171 cc @cloud-fan and @gatorsmile, WDYT? actually I happened to meet this and thought it's better be fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21977 Seems okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212499322 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- or `Option[Seq[ScalaReflection.Schema]]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212499164 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -57,8 +59,21 @@ case class ScalaUDF( children: Seq[Expression], inputTypes: Seq[DataType], udfName: Option[String]) = { -this( - function, dataType, children, inputTypes, udfName, nullable = true, udfDeterministic = true) +this(function, dataType, children, inputTypes, udfName, + nullable = true, udfDeterministic = true, nullableTypes = Nil) + } + + // Constructor from Spark 2.3 --- End diff -- By convention, everything under catalyst package is private, so compatibility is not a concern here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r212499194 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -60,14 +61,20 @@ private[spark] object PythonEvalType { */ private[spark] abstract class BasePythonRunner[IN, OUT]( funcs: Seq[ChainedPythonFunctions], -bufferSize: Int, -reuseWorker: Boolean, evalType: Int, -argOffsets: Array[Array[Int]]) +argOffsets: Array[Array[Int]], +conf: SparkConf) extends Logging { require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs") + private val bufferSize = conf.getInt("spark.buffer.size", 65536) + private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) + // each python worker gets an equal part of the allocation. the worker pool will grow to the + // number of concurrent tasks, which is determined by the number of cores in this executor. + private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + .map(_ / conf.getInt("spark.executor.cores", 1)) --- End diff -- > I don't think that the DataBricks style guide applies to Apache projects. I sent an email to dev mailing list - http://apache-spark-developers-list.1001551.n3.nabble.com/Porting-or-explicitly-linking-project-style-in-Apache-Spark-based-on-https-github-com-databricks-scae-td24790.html I was thinking 2 indents for continuation lines are more common in the codebase and thought better follow this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...
Github user ankuriitg commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212498716 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener( val e = it.next() if (job.stageIds.contains(e.getKey()._1)) { val stage = e.getValue() - stage.status = v1.StageStatus.SKIPPED - job.skippedStages += stage.info.stageId - job.skippedTasks += stage.info.numTasks - it.remove() - update(stage, now) + // Only update the stage if it has not finished already + if (v1.StageStatus.ACTIVE.equals(stage.status) || + v1.StageStatus.PENDING.equals(stage.status)) { +stage.status = v1.StageStatus.SKIPPED +job.skippedStages += stage.info.stageId +job.skippedTasks += stage.info.numTasks +job.activeStages -= 1 +it.remove() --- End diff -- I think the assumption here is that we will always receive onStageCompleted event before onJobEvent. If that does not occur for some reason, then any active stages are marked as skipped. I don't know the scenario when onStageCompleted event is not received before onJobEnd event (or received at all). Let me look further into it. Additionally, I will also fix the bug for updating pool. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21320 Seems fine to me too for the similar reasons. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212497182 --- Diff: python/pyspark/sql/tests.py --- @@ -6394,6 +6394,17 @@ def test_invalid_args(self): df.withColumn('mean_v', mean_udf(df['v']).over(ow)) +class DataSourceV2Tests(ReusedSQLTestCase): +def test_pyspark_udf_SPARK_25213(self): +from pyspark.sql.functions import udf + +df = self.spark.read.format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2").load() +result = df.withColumn('x', udf(lambda x: x, 'int')(df['i'])) --- End diff -- This only tests Project with Scalar PythonUDF? Might be better to also test Filter case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212496291 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy { config) val filterCondition = postScanFilters.reduceLeftOption(And) - val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + + val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) { +// add a projection before FilterExec to ensure that the rows are converted to unsafe +val filterExpr = filterCondition.get +FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, scan)) + } else { +filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + } // always add the projection, which will produce unsafe rows required by some operators - ProjectExec(project, withFilter) :: Nil + if (project.exists(hasScalarPythonUDF)) { +val references = project.map(_.references).reduce(_ ++ _).toSeq +ProjectExec(project, ProjectExec(references, withFilter)) :: Nil --- End diff -- Ok. Let's leave as it is now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22210 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95184/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22210 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22210 **[Test build #95184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95184/testReport)** for PR 22210 at commit [`6f2248d`](https://github.com/apache/spark/commit/6f2248d45716332ac78e44b1314011806f59deb8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95182/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212491921 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener( val e = it.next() if (job.stageIds.contains(e.getKey()._1)) { val stage = e.getValue() - stage.status = v1.StageStatus.SKIPPED - job.skippedStages += stage.info.stageId - job.skippedTasks += stage.info.numTasks - it.remove() - update(stage, now) + // Only update the stage if it has not finished already + if (v1.StageStatus.ACTIVE.equals(stage.status) || + v1.StageStatus.PENDING.equals(stage.status)) { +stage.status = v1.StageStatus.SKIPPED +job.skippedStages += stage.info.stageId +job.skippedTasks += stage.info.numTasks +job.activeStages -= 1 +it.remove() --- End diff -- Btw, there is an existing bug that we are not updating pool, etc which we do in onStageCompleted ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #95182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95182/testReport)** for PR 22192 at commit [`7c86fc5`](https://github.com/apache/spark/commit/7c86fc54c36954f1345eccc066873f7f90832657). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22203 Done, I added one example test failure to the description here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22211 **[Test build #95189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95189/testReport)** for PR 22211 at commit [`d269015`](https://github.com/apache/spark/commit/d26901516f414d16f8eb38183053d22dc3b056e3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22211 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2514/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22211 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212490964 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener( val e = it.next() if (job.stageIds.contains(e.getKey()._1)) { val stage = e.getValue() - stage.status = v1.StageStatus.SKIPPED - job.skippedStages += stage.info.stageId - job.skippedTasks += stage.info.numTasks - it.remove() - update(stage, now) + // Only update the stage if it has not finished already + if (v1.StageStatus.ACTIVE.equals(stage.status) || + v1.StageStatus.PENDING.equals(stage.status)) { +stage.status = v1.StageStatus.SKIPPED +job.skippedStages += stage.info.stageId +job.skippedTasks += stage.info.numTasks +job.activeStages -= 1 +it.remove() --- End diff -- I am not sure I follow - if that is the case, why are we doing this for active stages here ? onStageCompleted/onTaskEnd would be fired for active stages as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the issue where stage page agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22209 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the issue where stage page agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22209 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95183/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the issue where stage page agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22209 **[Test build #95183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95183/testReport)** for PR 22209 at commit [`cd1e856`](https://github.com/apache/spark/commit/cd1e8564cd04ae6d94857608b87ace3fef975136). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95188/testReport)** for PR 22190 at commit [`847300f`](https://github.com/apache/spark/commit/847300f76391c5e171e3f54b21bf6f2efc177f0e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2513/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212489210 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy { config) val filterCondition = postScanFilters.reduceLeftOption(And) - val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + + val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) { +// add a projection before FilterExec to ensure that the rows are converted to unsafe +val filterExpr = filterCondition.get +FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, scan)) + } else { +filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + } // always add the projection, which will produce unsafe rows required by some operators - ProjectExec(project, withFilter) :: Nil + if (project.exists(hasScalarPythonUDF)) { +val references = project.map(_.references).reduce(_ ++ _).toSeq +ProjectExec(project, ProjectExec(references, withFilter)) :: Nil --- End diff -- That one was only added if there was a filter and if that filter ran a UDF. This will add an unnecessary project if both the filter and the project have python UDFs, but I thought that was probably okay. I can add a boolean to signal if the filter caused one to be added already if you think it's worth it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/22190 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212488758 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy { config) val filterCondition = postScanFilters.reduceLeftOption(And) - val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + + val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) { +// add a projection before FilterExec to ensure that the rows are converted to unsafe +val filterExpr = filterCondition.get +FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, scan)) + } else { +filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + } // always add the projection, which will produce unsafe rows required by some operators - ProjectExec(project, withFilter) :: Nil + if (project.exists(hasScalarPythonUDF)) { +val references = project.map(_.references).reduce(_ ++ _).toSeq +ProjectExec(project, ProjectExec(references, withFilter)) :: Nil --- End diff -- nit: If we already add Project on top of Filter, we don't need to add another Project here, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212488439 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy { config) val filterCondition = postScanFilters.reduceLeftOption(And) - val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + + val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) { +// add a projection before FilterExec to ensure that the rows are converted to unsafe +val filterExpr = filterCondition.get +FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, scan)) + } else { +filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + } // always add the projection, which will produce unsafe rows required by some operators - ProjectExec(project, withFilter) :: Nil + if (project.exists(hasScalarPythonUDF)) { +val references = project.map(_.references).reduce(_ ++ _).toSeq +ProjectExec(project, ProjectExec(references, withFilter)) :: Nil --- End diff -- oh, I see. It is also used to make sure PythonUDF in top Project takes unsafe row input. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212488387 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- @skonto @lrytz this might be of interest. Don't think it's a Scala issue per se but just checking if that behavior change makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22206 **[Test build #95187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95187/testReport)** for PR 22206 at commit [`550368e`](https://github.com/apache/spark/commit/550368eaeebdc87f2c89bad7214f2624784eeb04). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212488266 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy { config) val filterCondition = postScanFilters.reduceLeftOption(And) - val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + + val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) { +// add a projection before FilterExec to ensure that the rows are converted to unsafe +val filterExpr = filterCondition.get +FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, scan)) + } else { +filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + } // always add the projection, which will produce unsafe rows required by some operators - ProjectExec(project, withFilter) :: Nil + if (project.exists(hasScalarPythonUDF)) { +val references = project.map(_.references).reduce(_ ++ _).toSeq +ProjectExec(project, ProjectExec(references, withFilter)) :: Nil --- End diff -- The v2 data sources return `InternalRow`, not `UnsafeRow`. Python UDFs can't handle `InternalRow`, so this is intended to add a projection to convert to unsafe before the projection that contains a python UDF. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22206 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2512/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org