[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21931
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...

2018-08-23 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22187
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21931
  
**[Test build #95190 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95190/testReport)**
 for PR 21931 at commit 
[`6abeb06`](https://github.com/apache/spark/commit/6abeb06e52ed51ff9a30acc5801b0794ce778e2c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22207: [SPARK-25214][SS]Fix the issue that Kafka v2 sour...

2018-08-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22207#discussion_r212526373
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala
 ---
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.mutable
+import scala.util.Random
+
+import org.scalatest.time.SpanSugar._
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter}
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+/**
+ * This is a basic test trait which will set up a Kafka cluster that keeps 
only several records in
+ * a topic and ages out records very quickly. This is a helper trait to 
test
+ * "failonDataLoss=false" case with missing offsets.
+ *
+ * Note: there is a hard-code 30 seconds delay 
(kafka.log.LogManager.InitialTaskDelayMs) to clean up
+ * records. Hence each class extending this trait needs to wait at least 
30 seconds (or even longer
+ * when running on a slow Jenkins machine) before records start to be 
removed. To make sure a test
+ * does see missing offsets, you can check the earliest offset in 
`eventually` and make sure it's
+ * not 0 rather than sleeping a hard-code duration.
+ */
+trait KafkaMissingOffsetsTest extends SharedSQLContext {
+
+  protected var testUtils: KafkaTestUtils = _
+
+  override def createSparkSession(): TestSparkSession = {
+// Set maxRetries to 3 to handle NPE from `poll` when deleting a topic
+new TestSparkSession(new SparkContext("local[2,3]", 
"test-sql-context", sparkConf))
+  }
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+testUtils = new KafkaTestUtils {
+  override def brokerConfiguration: Properties = {
+val props = super.brokerConfiguration
+// Try to make Kafka clean up messages as fast as possible. 
However, there is a hard-code
+// 30 seconds delay (kafka.log.LogManager.InitialTaskDelayMs) so 
this test should run at
+// least 30 seconds.
+props.put("log.cleaner.backoff.ms", "100")
+// The size of RecordBatch V2 increases to support transactional 
write.
+props.put("log.segment.bytes", "70")
+props.put("log.retention.bytes", "40")
+props.put("log.retention.check.interval.ms", "100")
+props.put("delete.retention.ms", "10")
+props.put("log.flush.scheduler.interval.ms", "10")
+props
+  }
+}
+testUtils.setup()
+  }
+
+  override def afterAll(): Unit = {
+if (testUtils != null) {
+  testUtils.teardown()
+  testUtils = null
+}
+super.afterAll()
+  }
+}
+
+class KafkaDontFailOnDataLossSuite extends KafkaMissingOffsetsTest {
+
+  import testImplicits._
+
+  private val topicId = new AtomicInteger(0)
+
+  private def newTopic(): String = 
s"failOnDataLoss-${topicId.getAndIncrement()}"
+
+  /**
+   * @param testStreamingQuery whether to test a streaming query or a 
batch query.
+   * @param writeToTable the function to write the specified [[DataFrame]] 
to the given table.
+   */
+  private def verifyMissingOffsetsDontCauseDuplicatedRecords(
+  testStreamingQuery: Boolean)(writeToTable: (DataFrame, String) => 
Unit): Unit = {
+val topic = newTopic()
+testUtils.createTopic(topic, partitions = 1)
+testUtils.sendMessages(topic, (0 until 50).map(_.toString).toArray)
+
+eventually(timeout(60.seconds)) {
+  assert(
+testUtils.getEarliestOffsets(Set(topic)).head._2 > 0,
+"Kafka didn't delete records after 1 minute")
+}
+
+ 

[GitHub] spark pull request #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-251...

2018-08-23 Thread bersprockets
Github user bersprockets closed the pull request at:

https://github.com/apache/spark/pull/22079


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

2018-08-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22042#discussion_r212507190
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
 ---
@@ -160,6 +160,23 @@ abstract class KafkaSourceTest extends StreamTest with 
SharedSQLContext {
   s"AddKafkaData(topics = $topics, data = $data, message = $message)"
   }
 
+  object WithKafkaProducer {
+def apply(
+topic: String,
+producer: KafkaProducer[String, String])(
--- End diff --

Ping on this comment. Maybe you missed this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

2018-08-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22042#discussion_r212522664
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
 ---
@@ -161,6 +161,22 @@ abstract class KafkaSourceTest extends StreamTest with 
SharedSQLContext with Kaf
   s"AddKafkaData(topics = $topics, data = $data, message = $message)"
   }
 
+  object WithKafkaProducer {
--- End diff --

nit: This is not creating a KafkaProducer .. as most `With***` methods. The 
point of this is to force synchronization of the consumer. So maybe rename it 
to `WithOffsetSync { ... }`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

2018-08-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22042#discussion_r212521083
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala
 ---
@@ -239,56 +335,74 @@ private[kafka010] case class InternalKafkaConsumer(
   }
 
   /**
-   * Get the record for the given offset if available. Otherwise it will 
either throw error
-   * (if failOnDataLoss = true), or return the next available offset 
within [offset, untilOffset),
-   * or null.
+   * Get the fetched record for the given offset if available.
+   *
+   * If the record is invisible (either a  transaction message, or an 
aborted message when the
+   * consumer's `isolation.level` is `read_committed`), it will return a 
`FetchedRecord` with the
+   * next offset to fetch.
+   *
+   * This method also will try the best to detect data loss. If 
`failOnDataLoss` is true`, it will
+   * throw an exception when we detect an unavailable offset. If 
`failOnDataLoss` is `false`, this
+   * method will return `null` if the next available record is within 
[offset, untilOffset).
*
* @throws OffsetOutOfRangeException if `offset` is out of range
* @throws TimeoutException if cannot fetch the record in 
`pollTimeoutMs` milliseconds.
*/
-  private def fetchData(
+  private def fetchRecord(
   offset: Long,
   untilOffset: Long,
   pollTimeoutMs: Long,
-  failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]] = 
{
-if (offset != nextOffsetInFetchedData || !fetchedData.hasNext()) {
-  // This is the first fetch, or the last pre-fetched data has been 
drained.
-  // Seek to the offset because we may call seekToBeginning or 
seekToEnd before this.
-  seek(offset)
-  poll(pollTimeoutMs)
-}
-
-if (!fetchedData.hasNext()) {
-  // We cannot fetch anything after `poll`. Two possible cases:
-  // - `offset` is out of range so that Kafka returns nothing. Just 
throw
-  // `OffsetOutOfRangeException` to let the caller handle it.
-  // - Cannot fetch any data before timeout. TimeoutException will be 
thrown.
-  val range = getAvailableOffsetRange()
-  if (offset < range.earliest || offset >= range.latest) {
-throw new OffsetOutOfRangeException(
-  Map(topicPartition -> java.lang.Long.valueOf(offset)).asJava)
+  failOnDataLoss: Boolean): FetchedRecord = {
+if (offset != fetchedData.nextOffsetInFetchedData) {
+  // This is the first fetch, or the fetched data has been reset.
+  // Fetch records from Kafka and update `fetchedData`.
+  fetchData(offset, pollTimeoutMs)
+} else if (!fetchedData.hasNext) { // The last pre-fetched data has 
been drained.
+  if (offset < fetchedData.offsetAfterPoll) {
+// Offsets in [offset, fetchedData.offsetAfterPoll) are invisible. 
Return a record to ask
+// the next call to start from `fetchedData.offsetAfterPoll`.
+fetchedData.reset()
+return fetchedRecord.withRecord(null, fetchedData.offsetAfterPoll)
   } else {
-throw new TimeoutException(
-  s"Cannot fetch record for offset $offset in $pollTimeoutMs 
milliseconds")
+// Fetch records from Kafka and update `fetchedData`.
+fetchData(offset, pollTimeoutMs)
   }
+}
+
+if (!fetchedData.hasNext) {
+  // When we reach here, we have already tried to poll from Kafka. As 
`fetchedData` is still
+  // empty, all messages in [offset, fetchedData.offsetAfterPoll) are 
invisible. Return a
+  // record to ask the next call to start from 
`fetchedData.offsetAfterPoll`.
+  assert(offset <= fetchedData.offsetAfterPoll,
+s"seek to $offset and poll but the offset was reset to 
${fetchedData.offsetAfterPoll}")
+  fetchedRecord.withRecord(null, fetchedData.offsetAfterPoll)
 } else {
   val record = fetchedData.next()
-  nextOffsetInFetchedData = record.offset + 1
   // In general, Kafka uses the specified offset as the start point, 
and tries to fetch the next
   // available offset. Hence we need to handle offset mismatch.
   if (record.offset > offset) {
+val range = getAvailableOffsetRange()
+if (range.earliest <= offset) {
+  // `offset` is still valid but the corresponding message is 
invisible. We should skip it
+  // and jump to `record.offset`. Here we move `fetchedData` back 
so that the next call of
+  // `fetchRecord` can just return `record` directly.
+  fetchedData.previous()
+  return fetchedRecord.withRecord(null, record.offset)
+}
 // This 

[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

2018-08-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22042#discussion_r212522432
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
 ---
@@ -161,6 +161,22 @@ abstract class KafkaSourceTest extends StreamTest with 
SharedSQLContext with Kaf
   s"AddKafkaData(topics = $topics, data = $data, message = $message)"
   }
 
+  object WithKafkaProducer {
+def apply(
+topic: String,
+producer: KafkaProducer[String, String])(
+func: KafkaProducer[String, String] => Unit): AssertOnQuery = {
--- End diff --

nit: AssertOnQuery -> StreamAction


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

2018-08-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22042#discussion_r212504622
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala
 ---
@@ -331,6 +331,7 @@ private[kafka010] case class 
KafkaMicroBatchPartitionReader(
 offsetRange.topicPartition, executorKafkaParams, reuseKafkaConsumer)
 
   private val rangeToRead = resolveRange(offsetRange)
+
--- End diff --

unnecessary 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21087
  
**[Test build #95194 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95194/testReport)**
 for PR 21087 at commit 
[`ebd9265`](https://github.com/apache/spark/commit/ebd926530c1d8b2f515a4a233f5963eafc17e460).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r212521169
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -164,9 +165,12 @@ case class BucketSpec(
 numBuckets: Int,
 bucketColumnNames: Seq[String],
 sortColumnNames: Seq[String]) {
-  if (numBuckets <= 0 || numBuckets >= 10) {
+  def conf: SQLConf = SQLConf.get
+
+  if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) {
--- End diff --

We can merge this PR first. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21087
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r212521145
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -164,9 +165,12 @@ case class BucketSpec(
 numBuckets: Int,
 bucketColumnNames: Seq[String],
 sortColumnNames: Seq[String]) {
-  if (numBuckets <= 0 || numBuckets >= 10) {
+  def conf: SQLConf = SQLConf.get
+
+  if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) {
--- End diff --

Could you submit a followup PR to address this message issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21732
  
**[Test build #95193 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95193/testReport)**
 for PR 21732 at commit 
[`5f95bd0`](https://github.com/apache/spark/commit/5f95bd0cf1bd308c7df55c41caef7a9f19368f5d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2518/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-08-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r212520685
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala
 ---
@@ -19,25 +19,85 @@ package org.apache.spark.sql.execution.aggregate
 
 import scala.language.existentials
 
-import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.{AnalysisException, Encoder}
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer
-import org.apache.spark.sql.catalyst.encoders.encoderFor
+import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, 
UnresolvedDeserializer}
+import org.apache.spark.sql.catalyst.encoders.{encoderFor, 
ExpressionEncoder}
 import org.apache.spark.sql.catalyst.expressions._
 import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateFunction, 
DeclarativeAggregate, TypedImperativeAggregate}
 import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection
-import org.apache.spark.sql.catalyst.expressions.objects.Invoke
+import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull, 
Invoke, NewInstance, WrapOption}
 import org.apache.spark.sql.expressions.Aggregator
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
 object TypedAggregateExpression {
+
+  // Checks if given encoder is for `Option[Product]`.
+  def isOptProductEncoder(encoder: ExpressionEncoder[_]): Boolean = {
+// Only Option[Product] is non-flat.
+encoder.clsTag.runtimeClass == classOf[Option[_]] && !encoder.flat
+  }
+
+  /**
+   * Flattens serializers and deserializer of given encoder. We only 
flatten encoder
+   * of `Option[Product]` class.
+   */
+  def flattenOptProductEncoder(encoder: ExpressionEncoder[_]): 
ExpressionEncoder[_] = {
--- End diff --

I will go to add some tests against this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2517/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21732
  
**[Test build #95192 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95192/testReport)**
 for PR 21732 at commit 
[`9fc3f61`](https://github.com/apache/spark/commit/9fc3f6165156051142a8366a32726badaaa16bb7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstra...

2018-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22203


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22079
  
@bersprockets Could you please close this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22203
  
Thanks! Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21320


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2516/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21732
  
@cloud-fan I made attempt to remove `topLevel` parameter. The approach is 
to flatten serializers and deserialzer at `TypedAggregateExpression`. So users 
are not aware of difference when using expression encoder.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
Thanks! Merged to master. 

BTW, we can keep thinking whether there are other better solutions for 
nested column pruning. 

Also cc @dongjoon-hyun If you are interested in the work for supporting ORC 
nested column pruning. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21732
  
**[Test build #95191 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95191/testReport)**
 for PR 21732 at commit 
[`ed3d5cb`](https://github.com/apache/spark/commit/ed3d5cb697b10af2e2cf4c78ab521d4d0b2f3c9b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22211
  
cc @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22206
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95187/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22190
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22206
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22190
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95188/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22206
  
**[Test build #95187 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95187/testReport)**
 for PR 22206 at commit 
[`550368e`](https://github.com/apache/spark/commit/550368eaeebdc87f2c89bad7214f2624784eeb04).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22190
  
**[Test build #95188 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95188/testReport)**
 for PR 22190 at commit 
[`847300f`](https://github.com/apache/spark/commit/847300f76391c5e171e3f54b21bf6f2efc177f0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21273
  
@koertkuipers you wanna make a PR to make it configuration?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-08-23 Thread gerashegalov
Github user gerashegalov commented on the issue:

https://github.com/apache/spark/pull/22213
  
@witgo please take a look since you worked on #2379 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22213
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22213
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22213
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22213: [SPARK-25221][DEPLOY] Consistent trailing whitesp...

2018-08-23 Thread gerashegalov
GitHub user gerashegalov opened a pull request:

https://github.com/apache/spark/pull/22213

[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf 
values

## What changes were proposed in this pull request?

Stop trimming values of properties loaded from a file

## How was this patch tested?

Added unit test demonstrating the issue hit in production.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gerashegalov/spark gera/SPARK-25221

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22213


commit 874b2c82a83d94c338d8b1ef9c3b37074cc2e4cc
Author: Gera Shegalov 
Date:   2018-08-23T19:09:05Z

[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf 
values




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22211
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95189/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22211
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22211
  
**[Test build #95189 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95189/testReport)**
 for PR 22211 at commit 
[`d269015`](https://github.com/apache/spark/commit/d26901516f414d16f8eb38183053d22dc3b056e3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21320
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21320
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95185/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11105
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2515/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11105
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-23 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21273
  
i would suggest at least that when the quote character is changed that the 
empty value should change accordingly. an empty value of ```""``` makes no 
sense if the quote character is not ```"```.

also if we could agree on a quote character that means no quotes at all 
then i would suggest to change empty value back to null if that particular 
quote character is set. because no quoted empty string makes sense if the user 
is trying to write out unquoted values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21320
  
**[Test build #95185 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95185/testReport)**
 for PR 21320 at commit 
[`e6baf68`](https://github.com/apache/spark/commit/e6baf681e06e229d740af120491d1bf0f426af99).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22212
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22212
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22212
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22201: [SPARK-25209][SQL] Avoid deserializer check in Da...

2018-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22201


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22212: [SPARK-25220] Seperate kubernetes node selector c...

2018-08-23 Thread jweaver-personal
GitHub user jweaver-personal opened a pull request:

https://github.com/apache/spark/pull/22212

[SPARK-25220] Seperate kubernetes node selector config between driver and 
executors.

Seperated node selector config option between executors and driver.

This removes the spark.kubernetes.node.selector config option and seperates 
it to
spark.kubernetes.driver.selector and spark.kubernetes.executor.selector

to allow seperate node selectors on drivers and executors.

My personal use case for this change is that on AWS we have cheap 
spotinstances that can terminate at any moment which is okay for executors but 
not the driver.
With a single node selector option I am unable to use the spot instances to 
save costs on the executor nodes.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jweaver-personal/spark jweaver-dev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22212.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22212


commit a6c6371f8d0ba5e7ff55e83f7c8c955bf24096aa
Author: ? 
Date:   2018-08-24T01:55:06Z

KUBERNETES K8S
Seperated node selector config option between executors and driver.

This removes the spark.kubernetes.node.selector config option and seperates 
it to
spark.kubernetes.driver.selector and spark.kubernetes.executor.selector

to allow seperate node selectors on drivers and executors.

My personal use case for this change is that on AWS we have cheap 
spotinstances that can terminate at any moment which is okay for executors but 
not the driver.
With a single node selector option I am unable to use the spot instances to 
save costs on the executor nodes.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21931
  
**[Test build #95190 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95190/testReport)**
 for PR 21931 at commit 
[`6abeb06`](https://github.com/apache/spark/commit/6abeb06e52ed51ff9a30acc5801b0794ce778e2c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212504250
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
--- End diff --

+1 to all your comments. I'm overhauling this whole PR and will force push 
with a rebase once it seems to basically work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...

2018-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20611


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20611
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22195
  
cc @jerryshao 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22195: [SPARK-25205][CORE] Fix typo in spark.network.cry...

2018-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22195


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22195
  
Merged to master and branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22208: [SPARK-25216][SQL] Improve error message when a c...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22208#discussion_r212500744
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -216,8 +216,16 @@ class Dataset[T] private[sql](
   private[sql] def resolve(colName: String): NamedExpression = {
 queryExecution.analyzed.resolveQuoted(colName, 
sparkSession.sessionState.analyzer.resolver)
   .getOrElse {
-throw new AnalysisException(
-  s"""Cannot resolve column name "$colName" among 
(${schema.fieldNames.mkString(", ")})""")
+if (schema.fieldNames.contains(colName)) {
+  throw new AnalysisException(
+s"""Cannot resolve column name "$colName" among 
(${schema.fieldNames.mkString(", ")}).
+   | Try adding backticks to the column name, i.e., 
`$colName`"""
--- End diff --

I would explain, for instance, if the name parts in the column should be 
kept as the part of its column name, try to quote them by backticks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22171
  
cc @cloud-fan and @gatorsmile, WDYT? actually I happened to meet this and 
thought it's better be fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21977
  
Seems okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212499322
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
--- End diff --

or `Option[Seq[ScalaReflection.Schema]]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212499164
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -57,8 +59,21 @@ case class ScalaUDF(
   children: Seq[Expression],
   inputTypes: Seq[DataType],
   udfName: Option[String]) = {
-this(
-  function, dataType, children, inputTypes, udfName, nullable = true, 
udfDeterministic = true)
+this(function, dataType, children, inputTypes, udfName,
+  nullable = true, udfDeterministic = true, nullableTypes = Nil)
+  }
+
+  // Constructor from Spark 2.3
--- End diff --

By convention, everything under catalyst package is private, so 
compatibility is not a concern here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r212499194
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
  */
 private[spark] abstract class BasePythonRunner[IN, OUT](
 funcs: Seq[ChainedPythonFunctions],
-bufferSize: Int,
-reuseWorker: Boolean,
 evalType: Int,
-argOffsets: Array[Array[Int]])
+argOffsets: Array[Array[Int]],
+conf: SparkConf)
   extends Logging {
 
   require(funcs.length == argOffsets.length, "argOffsets should have the 
same length as funcs")
 
+  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
+  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
+  // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
+  // number of concurrent tasks, which is determined by the number of 
cores in this executor.
+  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  .map(_ / conf.getInt("spark.executor.cores", 1))
--- End diff --

> I don't think that the DataBricks style guide applies to Apache projects.

I sent an email to dev mailing list - 
http://apache-spark-developers-list.1001551.n3.nabble.com/Porting-or-explicitly-linking-project-style-in-Apache-Spark-based-on-https-github-com-databricks-scae-td24790.html

I was thinking 2 indents for continuation lines are more common in the 
codebase and thought better follow this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread ankuriitg
Github user ankuriitg commented on a diff in the pull request:

https://github.com/apache/spark/pull/22209#discussion_r212498716
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -350,11 +350,16 @@ private[spark] class AppStatusListener(
 val e = it.next()
 if (job.stageIds.contains(e.getKey()._1)) {
   val stage = e.getValue()
-  stage.status = v1.StageStatus.SKIPPED
-  job.skippedStages += stage.info.stageId
-  job.skippedTasks += stage.info.numTasks
-  it.remove()
-  update(stage, now)
+  // Only update the stage if it has not finished already
+  if (v1.StageStatus.ACTIVE.equals(stage.status) ||
+  v1.StageStatus.PENDING.equals(stage.status)) {
+stage.status = v1.StageStatus.SKIPPED
+job.skippedStages += stage.info.stageId
+job.skippedTasks += stage.info.numTasks
+job.activeStages -= 1
+it.remove()
--- End diff --

I think the assumption here is that we will always receive onStageCompleted 
event before onJobEvent. If that does not occur for some reason, then any 
active stages are marked as skipped.
I don't know the scenario when onStageCompleted event is not received 
before onJobEnd event (or received at all). Let me look further into it. 
Additionally, I will also fix the bug for updating pool.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21320
  
Seems fine to me too for the similar reasons.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...

2018-08-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22206#discussion_r212497182
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -6394,6 +6394,17 @@ def test_invalid_args(self):
 df.withColumn('mean_v', mean_udf(df['v']).over(ow))
 
 
+class DataSourceV2Tests(ReusedSQLTestCase):
+def test_pyspark_udf_SPARK_25213(self):
+from pyspark.sql.functions import udf
+
+df = 
self.spark.read.format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2").load()
+result = df.withColumn('x', udf(lambda x: x, 'int')(df['i']))
--- End diff --

This only tests Project with Scalar PythonUDF? Might be better to also test 
Filter case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...

2018-08-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22206#discussion_r212496291
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy {
 config)
 
   val filterCondition = postScanFilters.reduceLeftOption(And)
-  val withFilter = filterCondition.map(FilterExec(_, 
scan)).getOrElse(scan)
+
+  val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) {
+// add a projection before FilterExec to ensure that the rows are 
converted to unsafe
+val filterExpr = filterCondition.get
+FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, 
scan))
+  } else {
+filterCondition.map(FilterExec(_, scan)).getOrElse(scan)
+  }
 
   // always add the projection, which will produce unsafe rows 
required by some operators
-  ProjectExec(project, withFilter) :: Nil
+  if (project.exists(hasScalarPythonUDF)) {
+val references = project.map(_.references).reduce(_ ++ _).toSeq
+ProjectExec(project, ProjectExec(references, withFilter)) :: Nil
--- End diff --

Ok. Let's leave as it is now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22210
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95184/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22210
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22210
  
**[Test build #95184 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95184/testReport)**
 for PR 22210 at commit 
[`6f2248d`](https://github.com/apache/spark/commit/6f2248d45716332ac78e44b1314011806f59deb8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95182/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/22209#discussion_r212491921
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -350,11 +350,16 @@ private[spark] class AppStatusListener(
 val e = it.next()
 if (job.stageIds.contains(e.getKey()._1)) {
   val stage = e.getValue()
-  stage.status = v1.StageStatus.SKIPPED
-  job.skippedStages += stage.info.stageId
-  job.skippedTasks += stage.info.numTasks
-  it.remove()
-  update(stage, now)
+  // Only update the stage if it has not finished already
+  if (v1.StageStatus.ACTIVE.equals(stage.status) ||
+  v1.StageStatus.PENDING.equals(stage.status)) {
+stage.status = v1.StageStatus.SKIPPED
+job.skippedStages += stage.info.stageId
+job.skippedTasks += stage.info.numTasks
+job.activeStages -= 1
+it.remove()
--- End diff --

Btw, there is an existing bug that we are not updating pool, etc which we 
do in onStageCompleted ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22192
  
**[Test build #95182 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95182/testReport)**
 for PR 22192 at commit 
[`7c86fc5`](https://github.com/apache/spark/commit/7c86fc54c36954f1345eccc066873f7f90832657).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22203
  
Done, I added one example test failure to the description here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22211
  
**[Test build #95189 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95189/testReport)**
 for PR 22211 at commit 
[`d269015`](https://github.com/apache/spark/commit/d26901516f414d16f8eb38183053d22dc3b056e3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22211
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2514/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22211
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/22209#discussion_r212490964
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -350,11 +350,16 @@ private[spark] class AppStatusListener(
 val e = it.next()
 if (job.stageIds.contains(e.getKey()._1)) {
   val stage = e.getValue()
-  stage.status = v1.StageStatus.SKIPPED
-  job.skippedStages += stage.info.stageId
-  job.skippedTasks += stage.info.numTasks
-  it.remove()
-  update(stage, now)
+  // Only update the stage if it has not finished already
+  if (v1.StageStatus.ACTIVE.equals(stage.status) ||
+  v1.StageStatus.PENDING.equals(stage.status)) {
+stage.status = v1.StageStatus.SKIPPED
+job.skippedStages += stage.info.stageId
+job.skippedTasks += stage.info.numTasks
+job.activeStages -= 1
+it.remove()
--- End diff --

I am not sure I follow - if that is the case, why are we doing this for 
active stages here ? onStageCompleted/onTaskEnd would be fired for active 
stages as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the issue where stage page agg...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22209
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the issue where stage page agg...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22209
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95183/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the issue where stage page agg...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22209
  
**[Test build #95183 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95183/testReport)**
 for PR 22209 at commit 
[`cd1e856`](https://github.com/apache/spark/commit/cd1e8564cd04ae6d94857608b87ace3fef975136).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22190
  
**[Test build #95188 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95188/testReport)**
 for PR 22190 at commit 
[`847300f`](https://github.com/apache/spark/commit/847300f76391c5e171e3f54b21bf6f2efc177f0e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22190
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22190
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2513/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...

2018-08-23 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/22206#discussion_r212489210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy {
 config)
 
   val filterCondition = postScanFilters.reduceLeftOption(And)
-  val withFilter = filterCondition.map(FilterExec(_, 
scan)).getOrElse(scan)
+
+  val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) {
+// add a projection before FilterExec to ensure that the rows are 
converted to unsafe
+val filterExpr = filterCondition.get
+FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, 
scan))
+  } else {
+filterCondition.map(FilterExec(_, scan)).getOrElse(scan)
+  }
 
   // always add the projection, which will produce unsafe rows 
required by some operators
-  ProjectExec(project, withFilter) :: Nil
+  if (project.exists(hasScalarPythonUDF)) {
+val references = project.map(_.references).reduce(_ ++ _).toSeq
+ProjectExec(project, ProjectExec(references, withFilter)) :: Nil
--- End diff --

That one was only added if there was a filter and if that filter ran a UDF. 
This will add an unnecessary project if both the filter and the project have 
python UDFs, but I thought that was probably okay. I can add a boolean to 
signal if the filter caused one to be added already if you think it's worth it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.

2018-08-23 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/22190
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...

2018-08-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22206#discussion_r212488758
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy {
 config)
 
   val filterCondition = postScanFilters.reduceLeftOption(And)
-  val withFilter = filterCondition.map(FilterExec(_, 
scan)).getOrElse(scan)
+
+  val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) {
+// add a projection before FilterExec to ensure that the rows are 
converted to unsafe
+val filterExpr = filterCondition.get
+FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, 
scan))
+  } else {
+filterCondition.map(FilterExec(_, scan)).getOrElse(scan)
+  }
 
   // always add the projection, which will produce unsafe rows 
required by some operators
-  ProjectExec(project, withFilter) :: Nil
+  if (project.exists(hasScalarPythonUDF)) {
+val references = project.map(_.references).reduce(_ ++ _).toSeq
+ProjectExec(project, ProjectExec(references, withFilter)) :: Nil
--- End diff --

nit: If we already add Project on top of Filter, we don't need to add 
another Project here, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...

2018-08-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22206#discussion_r212488439
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy {
 config)
 
   val filterCondition = postScanFilters.reduceLeftOption(And)
-  val withFilter = filterCondition.map(FilterExec(_, 
scan)).getOrElse(scan)
+
+  val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) {
+// add a projection before FilterExec to ensure that the rows are 
converted to unsafe
+val filterExpr = filterCondition.get
+FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, 
scan))
+  } else {
+filterCondition.map(FilterExec(_, scan)).getOrElse(scan)
+  }
 
   // always add the projection, which will produce unsafe rows 
required by some operators
-  ProjectExec(project, withFilter) :: Nil
+  if (project.exists(hasScalarPythonUDF)) {
+val references = project.map(_.references).reduce(_ ++ _).toSeq
+ProjectExec(project, ProjectExec(references, withFilter)) :: Nil
--- End diff --

oh, I see. It is also used to make sure PythonUDF in top Project takes 
unsafe row input.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212488387
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

@skonto @lrytz this might be of interest. Don't think it's a Scala issue 
per se but just checking if that behavior change makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...

2018-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22206
  
**[Test build #95187 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95187/testReport)**
 for PR 22206 at commit 
[`550368e`](https://github.com/apache/spark/commit/550368eaeebdc87f2c89bad7214f2624784eeb04).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...

2018-08-23 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/22206#discussion_r212488266
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy {
 config)
 
   val filterCondition = postScanFilters.reduceLeftOption(And)
-  val withFilter = filterCondition.map(FilterExec(_, 
scan)).getOrElse(scan)
+
+  val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) {
+// add a projection before FilterExec to ensure that the rows are 
converted to unsafe
+val filterExpr = filterCondition.get
+FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, 
scan))
+  } else {
+filterCondition.map(FilterExec(_, scan)).getOrElse(scan)
+  }
 
   // always add the projection, which will produce unsafe rows 
required by some operators
-  ProjectExec(project, withFilter) :: Nil
+  if (project.exists(hasScalarPythonUDF)) {
+val references = project.map(_.references).reduce(_ ++ _).toSeq
+ProjectExec(project, ProjectExec(references, withFilter)) :: Nil
--- End diff --

The v2 data sources return `InternalRow`, not `UnsafeRow`. Python UDFs 
can't handle `InternalRow`, so this is intended to add a projection to convert 
to unsafe before the projection that contains a python UDF.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...

2018-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22206
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2512/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >