[GitHub] [spark] SparkQA commented on pull request #29844: [SPARK-27872][K8s] Fix executor service account inconsistency for branch-2.4

2020-09-25 Thread GitBox


SparkQA commented on pull request #29844:
URL: https://github.com/apache/spark/pull/29844#issuecomment-698424261







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29863: [SPARK-32877][SQL][TEST] Add test for Hive UDF complex decimal type

2020-09-25 Thread GitBox


SparkQA commented on pull request #29863:
URL: https://github.com/apache/spark/pull/29863#issuecomment-698261937







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #29677: [SPARK-32820][SQL] Remove redundant shuffle exchanges inserted by EnsureRequirements

2020-09-25 Thread GitBox


sarutak commented on pull request #29677:
URL: https://github.com/apache/spark/pull/29677#issuecomment-698678194


   @c21 @imback82 @maropu @HyukjinKwon 
   Any other feedback for this change?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-698110509







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

2020-09-25 Thread GitBox


holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494434701



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
   if (executor == null) {
 exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
-if (decommissioned) {
-  val msg = "Asked to launch a task while decommissioned."
-  logError(msg)
-  driver match {
-case Some(endpoint) =>
-  logInfo("Sending DecommissionExecutor to driver.")
-  endpoint.send(DecommissionExecutor(executorId, 
ExecutorDecommissionInfo(msg)))
-case _ =>
-  logError("No registered driver to send Decommission to.")
-  }
-}

Review comment:
   Right, so we should resend the notice then right?

##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
   logInfo(s"Received tokens of ${tokenBytes.length} bytes")
   SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-case DecommissionSelf =>
-  logInfo("Received decommission self")
+case DecommissionExecutor =>
   decommissionSelf()
+
+case ExecutorSigPWRReceived =>
+  decommissionSelf()
+  if (driver.nonEmpty) {

Review comment:
   So we don’t ask the driver to stop scheduling jobs on us first, and the 
driver could ask us to run a job while we are part way through decommissioning. 
This won’t result in a failure because well accept the job but it will slow 
down the decommissioning. So swap the order of these two.

##
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
 blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = 
storageEndpoint.ask(DecommissionBlockManager)

Review comment:
   Why did you make this change?

##
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
 logInfo("Registering SIGPWR handler to trigger decommissioning.")
 SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-  "disabling worker decommission feature.")(decommissionSelf)
+  "disabling worker decommission feature.") {
+   self.send(WorkerSigPWRReceived)

Review comment:
   Can you look into what the difference of this behavior might cause at 
the system level and then tell me if that’s a desired change? I’m ok with us 
making changes here, I just want us to be intentional and know if we need to 
test the change and it seems like this change was incidental.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-25 Thread GitBox


HeartSaVioR commented on a change in pull request #28841:
URL: https://github.com/apache/spark/pull/28841#discussion_r494268346



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -467,6 +467,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `pathGlobFilter`: an optional glob pattern to only include files with 
paths matching
* the pattern. The syntax follows 
org.apache.hadoop.fs.GlobFilter.
* It does not change the behavior of partition discovery.
+   * `modifiedBefore`: an optional timestamp to only include files with
+   * modification times  occurring before the specified Time. The provided 
timestamp
+   * must be in the following form: -MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
+   * `modifiedAfter`: an optional timestamp to only include files with

Review comment:
   ditto

##
File path: python/pyspark/sql/readwriter.py
##
@@ -184,7 +196,8 @@ def json(self, path, schema=None, primitivesAsString=None, 
prefersDecimal=None,
  mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None,
  multiLine=None, allowUnquotedControlChars=None, lineSep=None, 
samplingRatio=None,
  dropFieldIfAllNull=None, encoding=None, locale=None, 
pathGlobFilter=None,
- recursiveFileLookup=None, allowNonNumericNumbers=None):
+ recursiveFileLookup=None, modifiedBefore=None, modifiedAfter=None,

Review comment:
   Probably better not to change the order. I think such huge number of 
parameters end users will use named parameter almost every time, but just to be 
sure.

##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -752,6 +764,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `pathGlobFilter`: an optional glob pattern to only include files with 
paths matching
* the pattern. The syntax follows 
org.apache.hadoop.fs.GlobFilter.
* It does not change the behavior of partition discovery.
+   * `modifiedBefore`: an optional timestamp to only include files with
+   * modification times  occurring before the specified Time. The provided 
timestamp
+   * must be in the following form: -MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
+   * `modifiedAfter`: an optional timestamp to only include files with

Review comment:
   ditto

##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -785,6 +803,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `pathGlobFilter`: an optional glob pattern to only include files with 
paths matching
* the pattern. The syntax follows 
org.apache.hadoop.fs.GlobFilter.
* It does not change the behavior of partition discovery.
+   * `modifiedBefore`: an optional timestamp to only include files with

Review comment:
   ditto

##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -785,6 +803,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `pathGlobFilter`: an optional glob pattern to only include files with 
paths matching
* the pattern. The syntax follows 
org.apache.hadoop.fs.GlobFilter.
* It does not change the behavior of partition discovery.
+   * `modifiedBefore`: an optional timestamp to only include files with
+   * modification times  occurring before the specified Time. The provided 
timestamp
+   * must be in the following form: -MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
+   * `modifiedAfter`: an optional timestamp to only include files with

Review comment:
   ditto

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/pathFilters.scala
##
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.util.{Locale, TimeZone}
+
+import org.apache.hadoop.fs.{FileStatus, GlobFilter}
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, 

[GitHub] [spark] SparkQA removed a comment on pull request #29533: [SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29533:
URL: https://github.com/apache/spark/pull/29533#issuecomment-698523837


   **[Test build #129084 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129084/testReport)**
 for PR 29533 at commit 
[`6449efa`](https://github.com/apache/spark/commit/6449efa72b2f7ff2aea53139520a04ef37b72f18).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29756: [SPARK-32885][SS] Add DataStreamReader.table API

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29756:
URL: https://github.com/apache/spark/pull/29756#issuecomment-698101187







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments num

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-698077548


   **[Test build #129057 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129057/testReport)**
 for PR 29054 at commit 
[`918aea4`](https://github.com/apache/spark/commit/918aea452c8e9c7d98574726e8e6ddde8c05624c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block

2020-09-25 Thread GitBox


github-actions[bot] closed pull request #27604:
URL: https://github.com/apache/spark/pull/27604


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29863: [SPARK-32877][SQL][TEST] Add test for Hive UDF complex decimal type

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29863:
URL: https://github.com/apache/spark/pull/29863#issuecomment-698261937







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-09-25 Thread GitBox


Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r494487660



##
File path: 
common/network-common/src/main/java/org/apache/spark/network/protocol/Encoders.java
##
@@ -44,6 +51,71 @@ public static String decode(ByteBuf buf) {
 }
   }
 
+  /** Bitmaps are encoded with their serialization length followed by the 
serialization bytes. */
+  public static class Bitmaps {
+public static int encodedLength(RoaringBitmap b) {
+  // Compress the bitmap before serializing it
+  b.trim();
+  b.runOptimize();
+  return 4 + b.serializedSizeInBytes();
+}
+
+public static void encode(ByteBuf buf, RoaringBitmap b) {
+  ByteBuffer outBuffer = ByteBuffer.allocate(b.serializedSizeInBytes());
+  try {
+b.serialize(new DataOutputStream(new OutputStream() {
+  ByteBuffer buffer;
+
+  OutputStream init(ByteBuffer buffer) {
+this.buffer = buffer;
+return this;
+  }
+
+  @Override
+  public void close() {
+  }
+
+  @Override
+  public void flush() {
+  }
+
+  @Override
+  public void write(int b) {
+buffer.put((byte) b);
+  }
+
+  @Override
+  public void write(byte[] b) {
+buffer.put(b);
+  }
+
+  @Override
+  public void write(byte[] b, int off, int l) {
+buffer.put(b, off, l);
+  }
+}.init(outBuffer)));
+  } catch (IOException e) {
+throw new RuntimeException("Exception while encoding bitmap", e);
+  }
+  byte[] bytes = outBuffer.array();
+  buf.writeInt(bytes.length);
+  buf.writeBytes(bytes);
+}
+
+public static RoaringBitmap decode(ByteBuf buf) {
+  int length = buf.readInt();
+  byte[] bytes = new byte[length];
+  buf.readBytes(bytes);

Review comment:
   This would require using ByteArrays.encode to encode the original byte 
arrays. I think @Ngone51 's recommendation earlier makes sense, that we should 
use roaringbitmap#serialize(ByteBuffer) to avoid the one additional memory copy 
during encoding. By doing that, we would directly serialize into the ByteBuf, 
and it won't be possible to use ByteArrays.encode to encode the corresponding 
byte arrays.

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/protocol/Encoders.java
##
@@ -44,6 +51,71 @@ public static String decode(ByteBuf buf) {
 }
   }
 
+  /** Bitmaps are encoded with their serialization length followed by the 
serialization bytes. */
+  public static class Bitmaps {
+public static int encodedLength(RoaringBitmap b) {
+  // Compress the bitmap before serializing it
+  b.trim();
+  b.runOptimize();
+  return 4 + b.serializedSizeInBytes();
+}
+
+public static void encode(ByteBuf buf, RoaringBitmap b) {
+  ByteBuffer outBuffer = ByteBuffer.allocate(b.serializedSizeInBytes());
+  try {
+b.serialize(new DataOutputStream(new OutputStream() {

Review comment:
   Good point, I think this also avoids one more memory copy.

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/protocol/Encoders.java
##
@@ -44,6 +51,71 @@ public static String decode(ByteBuf buf) {
 }
   }
 
+  /** Bitmaps are encoded with their serialization length followed by the 
serialization bytes. */
+  public static class Bitmaps {
+public static int encodedLength(RoaringBitmap b) {
+  // Compress the bitmap before serializing it
+  b.trim();
+  b.runOptimize();
+  return 4 + b.serializedSizeInBytes();
+}
+
+public static void encode(ByteBuf buf, RoaringBitmap b) {
+  ByteBuffer outBuffer = ByteBuffer.allocate(b.serializedSizeInBytes());

Review comment:
   Yes, BlockTransferMessage.toByteBuffer ensures that.
   Need to know the encodedLength in order to create the encoding ByteBuf in 
the first place.
   Will add a comment to clarify this.

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/protocol/Encoders.java
##
@@ -44,6 +51,71 @@ public static String decode(ByteBuf buf) {
 }
   }
 
+  /** Bitmaps are encoded with their serialization length followed by the 
serialization bytes. */
+  public static class Bitmaps {
+public static int encodedLength(RoaringBitmap b) {
+  // Compress the bitmap before serializing it
+  b.trim();
+  b.runOptimize();

Review comment:
   It should be invoked only once.
   BlockTransferMessage.toByteBuffer is where the initial call to encodedLength 
happens.
   It's only called once for each RoaringBitmap in the bitmap array.

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java
##
@@ -209,12 +225,17 @@ public void onData(String streamId, ByteBuffer buf) 
throws IOException {

[GitHub] [spark] SparkQA removed a comment on pull request #29859: [SPARK-32971][K8S][FOLLOWUP] Add `.toSeq` for Scala 2.13 compilation

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29859:
URL: https://github.com/apache/spark/pull/29859#issuecomment-698075792


   **[Test build #129056 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129056/testReport)**
 for PR 29859 at commit 
[`19d9a2f`](https://github.com/apache/spark/commit/19d9a2f302baf0cf9c9382f28622b83355103d7e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29868: [SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in inputCols

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29868:
URL: https://github.com/apache/spark/pull/29868#issuecomment-698707878







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #29471: [SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core

2020-09-25 Thread GitBox


holdenk commented on pull request #29471:
URL: https://github.com/apache/spark/pull/29471#issuecomment-698493851







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29843: [WIP][SPARK-29250] Upgrade to Hadoop 3.2.1 and move to shaded client

2020-09-25 Thread GitBox


sunchao commented on a change in pull request #29843:
URL: https://github.com/apache/spark/pull/29843#discussion_r494467897



##
File path: external/kafka-0-10-sql/pom.xml
##
@@ -79,6 +79,10 @@
   kafka-clients
   ${kafka.version}
 
+
+  com.google.code.findbugs

Review comment:
   Thanks. Yes will do after making all tests pass.

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
##
@@ -118,11 +118,15 @@ private[hive] object IsolatedClientLoader extends Logging 
{
   hadoopVersion: String,
   ivyPath: Option[String],
   remoteRepos: String): Seq[URL] = {
+val hadoopJarName = if (hadoopVersion.startsWith("3")) {

Review comment:
   Yes I think so. These modules should be available in any production 
Hadoop 3.x releases I think. See 
https://issues.apache.org/jira/browse/HADOOP-11804, it is fixed in 3.0.0-alpha2.

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
##
@@ -118,11 +118,15 @@ private[hive] object IsolatedClientLoader extends Logging 
{
   hadoopVersion: String,
   ivyPath: Option[String],
   remoteRepos: String): Seq[URL] = {
+val hadoopJarName = if (hadoopVersion.startsWith("3")) {

Review comment:
   Yes I believe so. These modules should be available in any production 
Hadoop 3.x releases I think. See 
https://issues.apache.org/jira/browse/HADOOP-11804, it is fixed in 3.0.0-alpha2.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251280







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29533: [SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s

2020-09-25 Thread GitBox


dongjoon-hyun commented on pull request #29533:
URL: https://github.com/apache/spark/pull/29533#issuecomment-698512589


   It seems that `SparkR` test fail.
   ```
   KubernetesSuite:
   - Run SparkPi with no resources
   - Run SparkPi with a very long application name.
   - Use SparkLauncher.NO_RESOURCE
   - Run SparkPi with a master URL without a scheme.
   - Run SparkPi with an argument.
   - Run SparkPi with custom labels, annotations, and environment variables.
   - All pods have the same service account by default
   - Run extraJVMOptions check on driver
   - Run SparkRemoteFileTest using a remote data file
   - Run SparkPi with env and mount secrets.
   - Run PySpark on simple pi.py example
   - Run PySpark with Python2 to test a pyfiles example
   - Run PySpark with Python3 to test a pyfiles example
   - Run PySpark with memory customization
   - Run in client mode.
   - Start pod creation from template
   - PVs with local storage
   - Launcher client dependencies
   - Run SparkR on simple dataframe.R example *** FAILED ***
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29798: [SPARK-32931][SQL] Unevaluable Expressions are not Foldable

2020-09-25 Thread GitBox


cloud-fan commented on pull request #29798:
URL: https://github.com/apache/spark/pull/29798#issuecomment-698770919


   thanks, merging to master!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29797:
URL: https://github.com/apache/spark/pull/29797#issuecomment-698907651







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #29833: [SPARK-32886][SPARK-31882][WEBUI][2.4] fix 'undefined' link in event timeline view

2020-09-25 Thread GitBox


srowen closed pull request #29833:
URL: https://github.com/apache/spark/pull/29833


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29024: [SPARK-32001][SQL]Create JDBC authentication provider developer API

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29024:
URL: https://github.com/apache/spark/pull/29024#issuecomment-698852755







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #29868: [SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in inputCols

2020-09-25 Thread GitBox


srowen commented on a change in pull request #29868:
URL: https://github.com/apache/spark/pull/29868#discussion_r494947145



##
File path: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala
##
@@ -91,8 +91,7 @@ class FeatureHasher(@Since("2.3.0") override val uid: String) 
extends Transforme
   /**
* Numeric columns to treat as categorical features. By default only string 
and boolean
* columns are treated as categorical, so this param can be used to 
explicitly specify the
-   * numerical columns to treat as categorical. Note, the relevant columns 
must also be set in
-   * `inputCols`.
+   * numerical columns to treat as categorical.

Review comment:
   This is still 'required' right? we're not making it an error, but it 
won't have any effect if not in inputCols.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader at final stage on DataWritingCommand

2020-09-25 Thread GitBox


SparkQA commented on pull request #29797:
URL: https://github.com/apache/spark/pull/29797#issuecomment-698914035


   **[Test build #129111 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129111/testReport)**
 for PR 29797 at commit 
[`84134b0`](https://github.com/apache/spark/commit/84134b09ef5295818a32d9dc4612141fe93fa05c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-25 Thread GitBox


c21 commented on a change in pull request #29804:
URL: https://github.com/apache/spark/pull/29804#discussion_r494083795



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/DisableUnnecessaryBucketedScan.scala
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.bucketing
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.{Partial, 
PartialMerge}
+import org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, 
HashClusteredDistribution}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.{FileSourceScanExec, FilterExec, 
ProjectExec, SortExec, SparkPlan}
+import org.apache.spark.sql.execution.aggregate.BaseAggregateExec
+import org.apache.spark.sql.execution.exchange.Exchange
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Disable unnecessary bucketed table scan based on actual physical query plan.
+ * NOTE: this rule is designed to be applied right after 
[[EnsureRequirements]],
+ * where all [[ShuffleExchangeExec]] and [[SortExec]] have been added to plan 
properly.
+ *
+ * When BUCKETING_ENABLED and AUTO_BUCKETED_SCAN_ENABLED are set to true, go 
through
+ * query plan to check where bucketed table scan is unnecessary, and disable 
bucketed table
+ * scan if needed.
+ *
+ * For all operators which [[hasInterestingPartition]] (i.e., require 
[[ClusteredDistribution]]
+ * or [[HashClusteredDistribution]]), check if the sub-plan for operator has 
[[Exchange]] and
+ * bucketed table scan. If yes, disable the bucketed table scan in the 
sub-plan.
+ * Only allow certain operators in sub-plan, which guarantees each sub-plan is 
single lineage
+ * (i.e., each operator has only one child). See details in
+ * [[disableBucketWithInterestingPartition]]).
+ *
+ * Examples:
+ * (1).join:
+ * SortMergeJoin(t1.i = t2.j)
+ */\
+ *Sort(i)Sort(j)
+ *  /   \
+ *  Shuffle(i)   Scan(t2: i, j)
+ */ (bucketed on column j, enable bucketed scan)
+ *   Scan(t1: i, j)
+ * (bucketed on column j, DISABLE bucketed scan)
+ *
+ * (2).aggregate:
+ * HashAggregate(i, ..., Final)
+ *  |
+ *  Shuffle(i)
+ *  |
+ * HashAggregate(i, ..., Partial)
+ *  |
+ *Filter
+ *  |
+ *  Scan(t1: i, j)
+ *  (bucketed on column j, DISABLE bucketed scan)
+ *
+ * The idea of [[hasInterestingPartition]] is inspired from "interesting 
order" in
+ * the paper "Access Path Selection in a Relational Database Management System"
+ * (http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf).
+ */
+case class DisableUnnecessaryBucketedScan(conf: SQLConf) extends 
Rule[SparkPlan] {
+
+  /**
+   * Disable bucketed table scan with pre-order traversal of plan.
+   *
+   * @param withInterestingPartition The traversed plan has operator with 
interesting partition.
+   * @param withExchange The traversed plan has [[Exchange]] operator.
+   */
+  private def disableBucketWithInterestingPartition(
+  plan: SparkPlan,
+  withInterestingPartition: Boolean,
+  withExchange: Boolean): SparkPlan = {
+plan match {
+  case p if hasInterestingPartition(p) =>
+// Operators with interesting partition, propagates 
`withInterestingPartition` as true
+// to its children.
+p.mapChildren(disableBucketWithInterestingPartition(_, true, false))
+  case exchange: Exchange if withInterestingPartition =>
+// Exchange operator propagates `withExchange` as true to its child
+// if the plan has interesting partition.
+exchange.mapChildren(disableBucketWithInterestingPartition(
+  _, withInterestingPartition, true))
+  case scan: FileSourceScanExec
+  if withInterestingPartition && withExchange && 
isBucketedScanWithoutFilter(scan) =>
+// Disable bucketed table scan if the plan has interesting partition,
+// and [[Exchange]] in the plan.
+scan.copy(disableBucketedScan = true)
+  case o =>
+if 

[GitHub] [spark] dongjoon-hyun closed pull request #29853: [SPARK-32977][SQL][DOCS] Fix JavaDoc on Default Save Mode

2020-09-25 Thread GitBox


dongjoon-hyun closed pull request #29853:
URL: https://github.com/apache/spark/pull/29853


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

2020-09-25 Thread GitBox


HyukjinKwon commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r49425



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##
@@ -93,6 +93,12 @@ object CSVUtils {
   value
 }
   }
+  if (header.sameElements(row)) {
+header
+  } else {
+// Ensure that the newly generated and existing headers are not 
duplicated.
+makeSafeHeader(header, caseSensitive, options)
+  }

Review comment:
   Can you check how R's `read_csv` works in this case? That patch was 
inspired by R's one.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##
@@ -93,6 +93,12 @@ object CSVUtils {
   value
 }
   }
+  if (header.sameElements(row)) {
+header
+  } else {
+// Ensure that the newly generated and existing headers are not 
duplicated.
+makeSafeHeader(header, caseSensitive, options)
+  }

Review comment:
   Can you check how R's `read_csv` works in this case? This behaviour was 
inspired by R's one.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##
@@ -93,6 +93,12 @@ object CSVUtils {
   value
 }
   }
+  if (header.sameElements(row)) {
+header
+  } else {
+// Ensure that the newly generated and existing headers are not 
duplicated.
+makeSafeHeader(header, caseSensitive, options)
+  }

Review comment:
   Can we follow this behaviour?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##
@@ -93,6 +93,12 @@ object CSVUtils {
   value
 }
   }
+  if (header.sameElements(row)) {
+header
+  } else {
+// Ensure that the newly generated and existing headers are not 
duplicated.
+makeSafeHeader(header, caseSensitive, options)
+  }

Review comment:
   I mean the numbering. Can. we create a name like `a1 a3 a4 a2` for `a, 
a, a, a, a.2`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29861: [SPARK-32971][K8S][FOLLOWUP] Fix k8s-core module compilation in Scala 2.13

2020-09-25 Thread GitBox


dongjoon-hyun commented on pull request #29861:
URL: https://github.com/apache/spark/pull/29861#issuecomment-698111394







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29843: [WIP][SPARK-29250] Upgrade to Hadoop 3.2.1 and move to shaded client

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29843:
URL: https://github.com/apache/spark/pull/29843#issuecomment-698120109







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #29756: [SPARK-32885][SS] Add DataStreamReader.table API

2020-09-25 Thread GitBox


cloud-fan closed pull request #29756:
URL: https://github.com/apache/spark/pull/29756


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29867: [SPARK-32889][SQL][TESTS][FOLLOWUP][test-hadoop2.7][test-hive1.2] Skip special column names test in Hive 1.2

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29867:
URL: https://github.com/apache/spark/pull/29867#issuecomment-698623561







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29857: [SPARK-32972][ML] Fix UTs of `mllib` module in Scala 2.13 except RandomForestRegressorSuite

2020-09-25 Thread GitBox


dongjoon-hyun commented on pull request #29857:
URL: https://github.com/apache/spark/pull/29857#issuecomment-698112690







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29828: [SPARK-32948][SQL] Optimize to_json and from_json expression chain

2020-09-25 Thread GitBox


maropu commented on a change in pull request #29828:
URL: https://github.com/apache/spark/pull/29828#discussion_r494010395



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types._
+
+class JsonSuite extends PlanTest with ExpressionEvalHelper {
+
+  object Optimizer extends RuleExecutor[LogicalPlan] {
+val batches = Batch("Json optimization", FixedPoint(10), 
OptimizeJsonExprs) :: Nil
+  }
+
+  val schema = StructType.fromDDL("a int, b int")
+
+  private val structAtt = 'struct.struct(schema).notNull
+
+  private val testRelation = LocalRelation(structAtt)
+
+  test("SPARK-32948: optimize from_json + to_json") {
+val options = Map.empty[String, String]
+
+val query1 = testRelation
+  .select(JsonToStructs(schema, options, StructsToJson(options, 
'struct)).as("struct"))
+val optimized1 = Optimizer.execute(query1.analyze)
+
+val expected = testRelation.select('struct.as("struct")).analyze
+comparePlans(optimized1, expected)
+
+val query2 = testRelation
+  .select(
+JsonToStructs(schema, options,
+  StructsToJson(options,
+JsonToStructs(schema, options,
+  StructsToJson(options, 'struct.as("struct"))
+val optimized2 = Optimizer.execute(query2.analyze)
+
+comparePlans(optimized2, expected)
+  }
+
+  test("SPARK-32948: not optimize from_json + to_json if schema is different") 
{
+val options = Map.empty[String, String]
+val schema = StructType.fromDDL("a int")
+
+val query = testRelation
+  .select(JsonToStructs(schema, options, StructsToJson(options, 
'struct)).as("struct"))
+val optimized = Optimizer.execute(query.analyze)
+
+val expected = testRelation.select(
+  JsonToStructs(schema, options, StructsToJson(options, 
'struct)).as("struct")).analyze
+comparePlans(optimized, expected)
+  }
+
+  test("SPARK-32948: not optimize from_json + to_json if option is not empty") 
{

Review comment:
   Could you add tests with different timezone cases, too?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang removed a comment on pull request #29864: [SPARK-32987][MESOS] Pass all `mllib` module UTs in Scala 2.13

2020-09-25 Thread GitBox


LuciferYang removed a comment on pull request #29864:
URL: https://github.com/apache/spark/pull/29864#issuecomment-698276483


   cc @srowen @dongjoon-hyun to review this patch ~ thx 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #29868: [SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in inputCols

2020-09-25 Thread GitBox


zhengruifeng commented on pull request #29868:
URL: https://github.com/apache/spark/pull/29868#issuecomment-698693191







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski commented on a change in pull request #29834: [SPARK-32963][SQL] empty string should be consistent for schema name in SparkGetSchemasOperation

2020-09-25 Thread GitBox


juliuszsompolski commented on a change in pull request #29834:
URL: https://github.com/apache/spark/pull/29834#discussion_r494156528



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetSchemasOperation.scala
##
@@ -77,7 +77,8 @@ private[hive] class SparkGetSchemasOperation(
 
   val globalTempViewDb = 
sqlContext.sessionState.catalog.globalTempViewManager.database
   val databasePattern = 
Pattern.compile(CLIServiceUtils.patternToRegex(schemaName))
-  if (databasePattern.matcher(globalTempViewDb).matches()) {
+  if (schemaName == null || schemaName.isEmpty ||

Review comment:
   
https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getSchemas(java.lang.String,%20java.lang.String)
   `schemaPattern - a schema name; must match the schema name as it is stored 
in the database; null means schema name should not be used to narrow down the 
search.`
   This doc doesn't mention empty string, but if it's treated as a pattern, it 
should default to empty string not matching anything.
   schemaName == null is already handled to match everything in patternToRegex:
   ```
 public static String patternToRegex(String pattern) {
   if (pattern == null) {
 return ".*";
   } else {
   ```
   
   So the current behaviour seems to be consistent with JDBC documentation?

##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetSchemasOperation.scala
##
@@ -77,7 +77,8 @@ private[hive] class SparkGetSchemasOperation(
 
   val globalTempViewDb = 
sqlContext.sessionState.catalog.globalTempViewManager.database
   val databasePattern = 
Pattern.compile(CLIServiceUtils.patternToRegex(schemaName))
-  if (databasePattern.matcher(globalTempViewDb).matches()) {
+  if (schemaName == null || schemaName.isEmpty ||

Review comment:
   
https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getTables(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String[])
   `schemaPattern - a schema name pattern; must match the schema name as it is 
stored in the database; "" retrieves those without a schema; null means that 
the schema name should not be used to narrow the search`
   The behaviour for getTables treats "" as no schema (e.g. local temp views), 
not all schemas, so it seems consistent that getSchemas wouldn't treat "" as 
"all schemas".





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29795: [SPARK-32511][SQL] Add dropFields method to Column class

2020-09-25 Thread GitBox


SparkQA commented on pull request #29795:
URL: https://github.com/apache/spark/pull/29795#issuecomment-698674623







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29866: [SPARK-32990][SQL] Migrate REFRESH TABLE to use UnresolvedTableOrView to resolve the identifier

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29866:
URL: https://github.com/apache/spark/pull/29866#issuecomment-698617283







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #29425: [SPARK-32350][FOLLOW-UP] Fix count update issue and partition the value list to a set of small batches for LevelDB writeAll

2020-09-25 Thread GitBox


HeartSaVioR commented on pull request #29425:
URL: https://github.com/apache/spark/pull/29425#issuecomment-698303116


   Sorry I still have several things in my plate and have been struggling with 
these things. You'd better ping @mridulm as he'd understand the patch well.
   
   @mridulm Appreciated if you have a time to look into this. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29844: [SPARK-27872][K8s] Fix executor service account inconsistency for branch-2.4

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29844:
URL: https://github.com/apache/spark/pull/29844#issuecomment-697043404







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29756: [SPARK-32885][SS] Add DataStreamReader.table API

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29756:
URL: https://github.com/apache/spark/pull/29756#issuecomment-698101187







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29828: [SPARK-32948][SQL] Optimize to_json and from_json expression chain

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29828:
URL: https://github.com/apache/spark/pull/29828#issuecomment-698088612







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MLnick commented on pull request #29850: [SPARK-32974][ML] FeatureHasher transform optimization

2020-09-25 Thread GitBox


MLnick commented on pull request #29850:
URL: https://github.com/apache/spark/pull/29850#issuecomment-698112434







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29844: [SPARK-27872][K8s] Fix executor service account inconsistency for branch-2.4

2020-09-25 Thread GitBox


dongjoon-hyun commented on pull request #29844:
URL: https://github.com/apache/spark/pull/29844#issuecomment-698422594


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29867: [SPARK-32889][SQL][TESTS][FOLLOWUP][test-hadoop2.7][test-hive1.2] Skip special column names test in Hive 1.2

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29867:
URL: https://github.com/apache/spark/pull/29867#issuecomment-698623561







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29795: [SPARK-32511][SQL] Add dropFields method to Column class

2020-09-25 Thread GitBox


cloud-fan commented on a change in pull request #29795:
URL: https://github.com/apache/spark/pull/29795#discussion_r494745836



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Column.scala
##
@@ -901,39 +901,125 @@ class Column(val expr: Expression) extends Logging {
*   // result: org.apache.spark.sql.AnalysisException: Ambiguous reference 
to fields
* }}}
*
+   * This method supports adding/replacing nested fields directly e.g.
+   *
+   * {{{
+   *   val df = sql("SELECT named_struct('a', named_struct('a', 1, 'b', 2)) 
struct_col")
+   *   df.select($"struct_col".withField("a.c", lit(3)).withField("a.d", 
lit(4)))
+   *   // result: {"a":{"a":1,"b":2,"c":3,"d":4}}
+   * }}}
+   *
+   * However, if you are going to add/replace multiple nested fields, it is 
more optimal to extract
+   * out the nested struct before adding/replacing multiple fields e.g.
+   *
+   * {{{
+   *   val df = sql("SELECT named_struct('a', named_struct('a', 1, 'b', 2)) 
struct_col")
+   *   df.select($"struct_col".withField("a", $"struct_col.a".withField("c", 
lit(3)).withField("d", lit(4
+   *   // result: {"a":{"a":1,"b":2,"c":3,"d":4}}
+   * }}}
+   *
* @group expr_ops
* @since 3.1.0
*/
   // scalastyle:on line.size.limit
   def withField(fieldName: String, col: Column): Column = withExpr {
 require(fieldName != null, "fieldName cannot be null")
 require(col != null, "col cannot be null")
+updateFieldsHelper(expr, nameParts(fieldName), name => WithField(name, 
col.expr))
+  }
 
-val nameParts = if (fieldName.isEmpty) {
+  // scalastyle:off line.size.limit
+  /**
+   * An expression that drops fields in `StructType` by name.

Review comment:
   It's semantically noop. We can optimize away the struct reconstructing 
later.

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsBenchmark.scala
##
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
+
+/**
+ * Benchmark to measure Spark's performance analyzing and optimizing long 
UpdateFields chains.
+ *
+ * {{{
+ *   To run this benchmark:
+ *   1. without sbt:
+ *  bin/spark-submit --class  
+ *   2. with sbt:
+ *  build/sbt "sql/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain "
+ *   Results will be written to "benchmarks/UpdateFieldsBenchmark-results.txt".
+ * }}}
+ */
+object UpdateFieldsBenchmark extends SqlBasedBenchmark {
+
+  private def nestedColName(d: Int, colNum: Int): String = 
s"nested${d}Col$colNum"
+
+  private def nestedStructType(
+  colNums: Seq[Int],
+  nullable: Boolean,
+  maxDepth: Int,
+  currDepth: Int = 1): StructType = {
+
+if (currDepth == maxDepth) {
+  val fields = colNums.map { colNum =>
+val name = nestedColName(currDepth, colNum)
+StructField(name, IntegerType, nullable = false)
+  }
+  StructType(fields)
+} else {
+  val fields = colNums.foldLeft(Seq.empty[StructField]) {
+case (structFields, colNum) if colNum == 0 =>
+  val nested = nestedStructType(colNums, nullable, maxDepth, currDepth 
+ 1)
+  structFields :+ StructField(nestedColName(currDepth, colNum), 
nested, nullable)
+case (structFields, colNum) =>
+  val name = nestedColName(currDepth, colNum)
+  structFields :+ StructField(name, IntegerType, nullable = false)
+  }
+  StructType(fields)
+}
+  }
+
+  private def nestedRow(colNums: Seq[Int], maxDepth: Int, currDepth: Int = 1): 
Row = {
+if (currDepth == maxDepth) {
+  Row.fromSeq(colNums)
+} else {
+  val values = colNums.foldLeft(Seq.empty[Any]) {
+case (values, colNum) if colNum == 0 =>
+  values :+ nestedRow(colNums, maxDepth, currDepth + 1)
+case (values, colNum) =>
+  values :+ colNum
+  }
+  

[GitHub] [spark] srowen commented on pull request #29857: [SPARK-32972][ML] Fix UTs of `mllib` module in Scala 2.13 except RandomForestRegressorSuite

2020-09-25 Thread GitBox


srowen commented on pull request #29857:
URL: https://github.com/apache/spark/pull/29857#issuecomment-698479253







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29833: [SPARK-32886][SPARK-31882][WEBUI][2.4] fix 'undefined' link in event timeline view

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29833:
URL: https://github.com/apache/spark/pull/29833#issuecomment-698402321







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #29833: [SPARK-32886][SPARK-31882][WEBUI][2.4] fix 'undefined' link in event timeline view

2020-09-25 Thread GitBox


srowen commented on pull request #29833:
URL: https://github.com/apache/spark/pull/29833#issuecomment-698397395


   Jenkins retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-09-25 Thread GitBox


mridulm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r494003389



##
File path: 
common/network-common/src/main/java/org/apache/spark/network/protocol/Encoders.java
##
@@ -44,6 +51,71 @@ public static String decode(ByteBuf buf) {
 }
   }
 
+  /** Bitmaps are encoded with their serialization length followed by the 
serialization bytes. */
+  public static class Bitmaps {
+public static int encodedLength(RoaringBitmap b) {
+  // Compress the bitmap before serializing it
+  b.trim();
+  b.runOptimize();
+  return 4 + b.serializedSizeInBytes();
+}
+
+public static void encode(ByteBuf buf, RoaringBitmap b) {
+  ByteBuffer outBuffer = ByteBuffer.allocate(b.serializedSizeInBytes());
+  try {
+b.serialize(new DataOutputStream(new OutputStream() {
+  ByteBuffer buffer;
+
+  OutputStream init(ByteBuffer buffer) {
+this.buffer = buffer;
+return this;
+  }
+
+  @Override
+  public void close() {
+  }
+
+  @Override
+  public void flush() {
+  }
+
+  @Override
+  public void write(int b) {
+buffer.put((byte) b);
+  }
+
+  @Override
+  public void write(byte[] b) {
+buffer.put(b);
+  }
+
+  @Override
+  public void write(byte[] b, int off, int l) {
+buffer.put(b, off, l);
+  }
+}.init(outBuffer)));
+  } catch (IOException e) {
+throw new RuntimeException("Exception while encoding bitmap", e);
+  }

Review comment:
   Replace this with something more concise - for example see 
`UnsafeShuffleWriter.MyByteArrayOutputStream`.
   To illustrate, something like:
   ```
   MyBaos out = new MyBaos(b.serializedSizeInBytes());
   b.serialize(new DataOutputStream(out));
   int size = out.size();
   buf.writeInt(size);
   buf.writeBytes(out.getBuf(), 0, size);
   ```
   
   The last part could also be moved as `ByteArrays.encode(byte[] arr, int 
offset, int len)`

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/protocol/Encoders.java
##
@@ -44,6 +51,71 @@ public static String decode(ByteBuf buf) {
 }
   }
 
+  /** Bitmaps are encoded with their serialization length followed by the 
serialization bytes. */
+  public static class Bitmaps {
+public static int encodedLength(RoaringBitmap b) {
+  // Compress the bitmap before serializing it
+  b.trim();
+  b.runOptimize();

Review comment:
   `BitmapArrays` results in calling `trim` and `runOptimize` twice - 
refactor so that it is only done once for this codepath ?

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java
##
@@ -209,12 +225,17 @@ public void onData(String streamId, ByteBuffer buf) 
throws IOException {
 public void onComplete(String streamId) throws IOException {
try {
  streamHandler.onComplete(streamId);
- callback.onSuccess(ByteBuffer.allocate(0));
+ callback.onSuccess(meta.duplicate());

Review comment:
   Can you add a comment on why we are making this change ? From sending 
empty buffer to meta.

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java
##
@@ -181,6 +182,17 @@ public void onFailure(Throwable e) {
   private void processStreamUpload(final UploadStream req) {
 assert (req.body() == null);
 try {
+  // Retain the original metadata buffer, since it will be used during the 
invocation of
+  // this method. Will be released later.
+  req.meta.retain();
+  // Make a copy of the original metadata buffer. In benchmark, we noticed 
that
+  // we cannot respond the original metadata buffer back to the client, 
otherwise
+  // in cases where multiple concurrent shuffles are present, a wrong 
metadata might
+  // be sent back to client. This is related to the eager release of the 
metadata buffer,
+  // i.e., we always release the original buffer by the time the 
invocation of this
+  // method ends, instead of by the time we respond it to the client. This 
is necessary,
+  // otherwise we start seeing memory issues very quickly in benchmarks.
+  ByteBuffer meta = cloneBuffer(req.meta.nioByteBuffer());

Review comment:
   Since we are always making a copy of meta here; can we remove the 
`retain` + `release` below and instead always release it here and only rely on 
the cloned butter within this method ?

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockPusher.java
##
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file 

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29828: [SPARK-32948][SQL] Optimize to_json and from_json expression chain

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29828:
URL: https://github.com/apache/spark/pull/29828#issuecomment-698089064







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29869: [WIP][SPARK-32994][CORE] Update external accumulators before they entering into Spark listener event loop

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29869:
URL: https://github.com/apache/spark/pull/29869#issuecomment-698726138







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29591: [SPARK-32714][PYTHON] Initial pyspark-stubs port.

2020-09-25 Thread GitBox


HyukjinKwon commented on pull request #29591:
URL: https://github.com/apache/spark/pull/29591#issuecomment-698117104







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

2020-09-25 Thread GitBox


zhengruifeng commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-698162567


   ping @huaxingao 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29857: [SPARK-32972][ML] Fix UTs of `mllib` module in Scala 2.13 except RandomForestRegressorSuite

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29857:
URL: https://github.com/apache/spark/pull/29857#issuecomment-698087732







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nssalian commented on pull request #29844: [SPARK-27872][K8s][2.4] Fix executor service account inconsistency

2020-09-25 Thread GitBox


nssalian commented on pull request #29844:
URL: https://github.com/apache/spark/pull/29844#issuecomment-698610691







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

2020-09-25 Thread GitBox


srowen commented on a change in pull request #29852:
URL: https://github.com/apache/spark/pull/29852#discussion_r494390254



##
File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
##
@@ -91,20 +90,13 @@ class HashingTF @Since("3.0.0") private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val localNumFeatures = $(numFeatures)
-val localBinary = $(binary)
+val n = $(numFeatures)
+val updateFunc = if ($(binary)) (v: Double) => 1.0 else (v: Double) => v + 
1.0
 
 val hashUDF = udf { terms: Seq[_] =>
-  val termFrequencies = mutable.HashMap.empty[Int, 
Double].withDefaultValue(0.0)
-  terms.foreach { term =>
-val i = indexOf(term)
-if (localBinary) {
-  termFrequencies(i) = 1.0
-} else {
-  termFrequencies(i) += 1.0
-}
-  }
-  Vectors.sparse(localNumFeatures, termFrequencies.toSeq)
+  val map = new OpenHashMap[Int, Double]()

Review comment:
   This seems fine but is it faster than Scala's Map? the comment refers to 
the Java HashMap.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29860: [SPARK-32984][TESTS][SQL] Improve showing the differences between approved and actual plans of PlanStabilitySuite

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29860:
URL: https://github.com/apache/spark/pull/29860#issuecomment-698090732







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29756: [SPARK-32885][SS] Add DataStreamReader.table API

2020-09-25 Thread GitBox


cloud-fan commented on pull request #29756:
URL: https://github.com/apache/spark/pull/29756#issuecomment-698755963


   thanks, merging to master!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-25 Thread GitBox


HyukjinKwon commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-698110090







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] fqaiser94 commented on a change in pull request #29795: [SPARK-32511][SQL] Add dropFields method to Column class

2020-09-25 Thread GitBox


fqaiser94 commented on a change in pull request #29795:
URL: https://github.com/apache/spark/pull/29795#discussion_r494698625



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
##
@@ -541,57 +541,105 @@ case class StringToMap(text: Expression, pairDelim: 
Expression, keyValueDelim: E
 }
 
 /**
- * Adds/replaces field in struct by name.
+ * Represents an operation to be applied to the fields of a struct.
  */
-case class WithFields(
-structExpr: Expression,
-names: Seq[String],
-valExprs: Seq[Expression]) extends Unevaluable {
+trait StructFieldsOperation {
 
-  assert(names.length == valExprs.length)
+  val resolver: Resolver = SQLConf.get.resolver
+
+  /**
+   * Returns an updated list of StructFields and Expressions that will 
ultimately be used
+   * as the fields argument for [[StructType]] and as the children argument for
+   * [[CreateNamedStruct]] respectively inside of [[UpdateFields]].
+   */
+  def apply(values: Seq[(StructField, Expression)]): Seq[(StructField, 
Expression)]
+}
+
+/**
+ * Add or replace a field by name.
+ *
+ * We extend [[Unevaluable]] here to ensure that [[UpdateFields]] can include 
it as part of its
+ * children, and thereby enable the analyzer to resolve and transform valExpr 
as necessary.
+ */
+case class WithField(name: String, valExpr: Expression)
+  extends Unevaluable with StructFieldsOperation {
+
+  override def apply(values: Seq[(StructField, Expression)]): 
Seq[(StructField, Expression)] = {
+val newFieldExpr = (StructField(name, valExpr.dataType, valExpr.nullable), 
valExpr)
+if (values.exists { case (field, _) => resolver(field.name, name) }) {
+  values.map {
+case (field, _) if resolver(field.name, name) => newFieldExpr
+case x => x
+  }
+} else {
+  values :+ newFieldExpr
+}
+  }
+
+  override def children: Seq[Expression] = valExpr :: Nil
+
+  override def dataType: DataType = throw new UnresolvedException(this, 
"dataType")

Review comment:
   done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
##
@@ -541,57 +541,105 @@ case class StringToMap(text: Expression, pairDelim: 
Expression, keyValueDelim: E
 }
 
 /**
- * Adds/replaces field in struct by name.
+ * Represents an operation to be applied to the fields of a struct.
  */
-case class WithFields(
-structExpr: Expression,
-names: Seq[String],
-valExprs: Seq[Expression]) extends Unevaluable {
+trait StructFieldsOperation {
 
-  assert(names.length == valExprs.length)
+  val resolver: Resolver = SQLConf.get.resolver
+
+  /**
+   * Returns an updated list of StructFields and Expressions that will 
ultimately be used
+   * as the fields argument for [[StructType]] and as the children argument for
+   * [[CreateNamedStruct]] respectively inside of [[UpdateFields]].
+   */
+  def apply(values: Seq[(StructField, Expression)]): Seq[(StructField, 
Expression)]
+}
+
+/**
+ * Add or replace a field by name.
+ *
+ * We extend [[Unevaluable]] here to ensure that [[UpdateFields]] can include 
it as part of its
+ * children, and thereby enable the analyzer to resolve and transform valExpr 
as necessary.
+ */
+case class WithField(name: String, valExpr: Expression)
+  extends Unevaluable with StructFieldsOperation {
+
+  override def apply(values: Seq[(StructField, Expression)]): 
Seq[(StructField, Expression)] = {
+val newFieldExpr = (StructField(name, valExpr.dataType, valExpr.nullable), 
valExpr)
+if (values.exists { case (field, _) => resolver(field.name, name) }) {

Review comment:
   thanks for sharing the code, done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
##
@@ -541,57 +541,105 @@ case class StringToMap(text: Expression, pairDelim: 
Expression, keyValueDelim: E
 }
 
 /**
- * Adds/replaces field in struct by name.
+ * Represents an operation to be applied to the fields of a struct.
  */
-case class WithFields(
-structExpr: Expression,
-names: Seq[String],
-valExprs: Seq[Expression]) extends Unevaluable {
+trait StructFieldsOperation {
 
-  assert(names.length == valExprs.length)
+  val resolver: Resolver = SQLConf.get.resolver
+
+  /**
+   * Returns an updated list of StructFields and Expressions that will 
ultimately be used
+   * as the fields argument for [[StructType]] and as the children argument for
+   * [[CreateNamedStruct]] respectively inside of [[UpdateFields]].
+   */
+  def apply(values: Seq[(StructField, Expression)]): Seq[(StructField, 
Expression)]
+}
+
+/**
+ * Add or replace a field by name.
+ *
+ * We extend [[Unevaluable]] here to ensure that [[UpdateFields]] can include 
it as part of its
+ * children, and thereby enable the analyzer to resolve and transform valExpr 
as necessary.
+ */
+case class WithField(name: String, valExpr: Expression)
+  extends 

[GitHub] [spark] cloud-fan commented on a change in pull request #29860: [SPARK-32984][TESTS][SQL] Improve showing the differences between approved and actual plans of PlanStabilitySuite

2020-09-25 Thread GitBox


cloud-fan commented on a change in pull request #29860:
URL: https://github.com/apache/spark/pull/29860#discussion_r494832113



##
File path: sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala
##
@@ -153,23 +154,93 @@ trait PlanStabilitySuite extends TPCDSBase with 
DisableAdaptiveExecutionSuite {
   // write out for debugging
   FileUtils.writeStringToFile(actualSimplifiedFile, actualSimplified, 
StandardCharsets.UTF_8)
   FileUtils.writeStringToFile(actualExplainFile, explain, 
StandardCharsets.UTF_8)
+  val (approvedSimplifiedWithHint, actualSimplifiedWithHint) =
+addDiffHint(approvedSimplified, actualSimplified)
 
   fail(
 s"""
   |Plans did not match:
   |last approved simplified plan: 
${approvedSimplifiedFile.getAbsolutePath}
   |last approved explain plan: ${approvedExplainFile.getAbsolutePath}
   |
-  |$approvedSimplified
+  |$approvedSimplifiedWithHint
   |
   |actual simplified plan: ${actualSimplifiedFile.getAbsolutePath}
   |actual explain plan: ${actualExplainFile.getAbsolutePath}
   |
-  |$actualSimplified
+  |$actualSimplifiedWithHint
 """.stripMargin)
 }
   }
 
+  /**
+   * Add the hint to the simplified plans where they first become different.
+   */
+  private def addDiffHint(approvedSimplified: String, actualSimplified: String)
+: (String, String) = {
+// reverse the plan so we can compare the node from the bottom to top

Review comment:
   One hard problem is how to match the lines from both sides. It's 
possible that the left side has one more node in the middle, so simply matching 
lines bottom-up may not work. It's like git diff, we should do the match w.r.t. 
the content, which can be very complicated.
   
   Maybe we should just recommend some online text diff tools in the comment 
and ask people to use.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29056: [SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs

2020-09-25 Thread GitBox


gatorsmile commented on a change in pull request #29056:
URL: https://github.com/apache/spark/pull/29056#discussion_r494808052



##
File path: docs/sql-ref-syntax-ddl-create-table-hiveformat.md
##
@@ -36,6 +36,14 @@ CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
 [ LOCATION path ]

Review comment:
   The bucketSpec is still missing in CREATE HIVE FORMAT table, right?
   ```
   [ CLUSTERED BY ( col_name3, col_name4, ... ) 
   [ SORTED BY ( col_name [ ASC | DESC ], ... ) ] 
   INTO num_buckets BUCKETS ]
   ```

##
File path: docs/sql-ref-syntax-ddl-create-table-hiveformat.md
##
@@ -36,6 +36,14 @@ CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
 [ LOCATION path ]

Review comment:
   Any reason we did not add it? @huaxingao @GuoPhilipse 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-25 Thread GitBox


SparkQA commented on pull request #29797:
URL: https://github.com/apache/spark/pull/29797#issuecomment-698906945







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29797:
URL: https://github.com/apache/spark/pull/29797#issuecomment-698907651







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaborgsomogyi commented on a change in pull request #29024: [SPARK-32001][SQL]Create JDBC authentication provider developer API

2020-09-25 Thread GitBox


gaborgsomogyi commented on a change in pull request #29024:
URL: https://github.com/apache/spark/pull/29024#discussion_r494903467



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
##
@@ -23,12 +23,15 @@ import java.util.{Locale, Properties}
 import org.apache.commons.io.FilenameUtils
 
 import org.apache.spark.SparkFiles
+import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
 
 /**
+ * ::DeveloperApi::
  * Options for the JDBC data source.
  */
+@DeveloperApi

Review comment:
   @HyukjinKwon thanks for having a look!
   
   I agree that `JDBCOptions` mustn't be exposed. Let me change the code to 
show `option 1`. As said passing only `keytab: String, principal: String` is 
not enough because not all but some of the providers need further 
configurations. I've started to work on this this change (unless anybody has 
better option).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29831: [SPARK-32351][SQL] Show partially pushed down partition filters in explain()

2020-09-25 Thread GitBox


viirya commented on pull request #29831:
URL: https://github.com/apache/spark/pull/29831#issuecomment-699265145


   cc @maropu too



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid argumen

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699272519


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid argumen

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699272523


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129127/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments numbe

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699272519







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments num

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699231076


   **[Test build #129127 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129127/testReport)**
 for PR 29054 at commit 
[`766c931`](https://github.com/apache/spark/commit/766c931975821781b91e49013caa3c39a35f2cb2).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number erro

2020-09-25 Thread GitBox


SparkQA commented on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699272432


   **[Test build #129127 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129127/testReport)**
 for PR 29054 at commit 
[`766c931`](https://github.com/apache/spark/commit/766c931975821781b91e49013caa3c39a35f2cb2).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number erro

2020-09-25 Thread GitBox


SparkQA commented on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699287405


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33746/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29872: [SPARK-32996][Web-UI] Handle empty ExecutorMetrics in ExecutorMetricsJsonSerializer

2020-09-25 Thread GitBox


SparkQA commented on pull request #29872:
URL: https://github.com/apache/spark/pull/29872#issuecomment-699266269


   **[Test build #129125 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129125/testReport)**
 for PR 29872 at commit 
[`c27a699`](https://github.com/apache/spark/commit/c27a6994be6f580e331d49aeedfab2ca4c427e30).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29872: [SPARK-32996][Web-UI] Handle empty ExecutorMetrics in ExecutorMetricsJsonSerializer

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29872:
URL: https://github.com/apache/spark/pull/29872#issuecomment-699197271


   **[Test build #129125 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129125/testReport)**
 for PR 29872 at commit 
[`c27a699`](https://github.com/apache/spark/commit/c27a6994be6f580e331d49aeedfab2ca4c427e30).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29855:
URL: https://github.com/apache/spark/pull/29855#issuecomment-699207624


   **[Test build #129126 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129126/testReport)**
 for PR 29855 at commit 
[`85b0de8`](https://github.com/apache/spark/commit/85b0de8f48c8f998a41e794cf0a32c8bea35f237).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29855:
URL: https://github.com/apache/spark/pull/29855#issuecomment-699271839







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29855:
URL: https://github.com/apache/spark/pull/29855#issuecomment-699271839







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid argumen

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699257263


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33745/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments numbe

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699257257







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid argumen

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699257257


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29875: [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29875:
URL: https://github.com/apache/spark/pull/29875#issuecomment-699168581


   **[Test build #129124 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129124/testReport)**
 for PR 29875 at commit 
[`3f14f68`](https://github.com/apache/spark/commit/3f14f6842e04342297ac671bf9791a21ff7ec258).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29875: [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode

2020-09-25 Thread GitBox


SparkQA commented on pull request #29875:
URL: https://github.com/apache/spark/pull/29875#issuecomment-699274191


   **[Test build #129124 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129124/testReport)**
 for PR 29875 at commit 
[`3f14f68`](https://github.com/apache/spark/commit/3f14f6842e04342297ac671bf9791a21ff7ec258).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29875: [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode

2020-09-25 Thread GitBox


SparkQA commented on pull request #29875:
URL: https://github.com/apache/spark/pull/29875#issuecomment-699296378


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33747/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #29729: [SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API

2020-09-25 Thread GitBox


HeartSaVioR commented on pull request #29729:
URL: https://github.com/apache/spark/pull/29729#issuecomment-699269552


   Worth noting that the issue is not just occurred in theory, but I've seen 
the case multiple times around community report, customers, etc. Probably we'd 
feel better to document the change on security viewpoint (release note as 
well?) to notice the end users, but I hope the change on security requirement 
doesn't block resolving "real world" issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29875: [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29875:
URL: https://github.com/apache/spark/pull/29875#issuecomment-699274719







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29875: [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29875:
URL: https://github.com/apache/spark/pull/29875#issuecomment-699274719







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number erro

2020-09-25 Thread GitBox


SparkQA commented on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699281169


   **[Test build #129130 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129130/testReport)**
 for PR 29054 at commit 
[`95cfebe`](https://github.com/apache/spark/commit/95cfebeff7b0eb2b696e9882d8040ff635aeb68b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29875: [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode

2020-09-25 Thread GitBox


SparkQA commented on pull request #29875:
URL: https://github.com/apache/spark/pull/29875#issuecomment-699281142


   **[Test build #129129 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129129/testReport)**
 for PR 29875 at commit 
[`d7aeded`](https://github.com/apache/spark/commit/d7aeded2141a45ac770fb2926a3ed1ef55420fec).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29872: [SPARK-32996][Web-UI] Handle empty ExecutorMetrics in ExecutorMetricsJsonSerializer

2020-09-25 Thread GitBox


SparkQA commented on pull request #29872:
URL: https://github.com/apache/spark/pull/29872#issuecomment-699402597


   **[Test build #129131 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129131/testReport)**
 for PR 29872 at commit 
[`2967673`](https://github.com/apache/spark/commit/29676739bbb2ef6db17cd170da7fb1ed24ffa769).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29054: [SPARK-32243][SQL]HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid argumen

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29054:
URL: https://github.com/apache/spark/pull/29054#issuecomment-699400125


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129128/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29872: [SPARK-32996][Web-UI] Handle empty ExecutorMetrics in ExecutorMetricsJsonSerializer

2020-09-25 Thread GitBox


AmplabJenkins commented on pull request #29872:
URL: https://github.com/apache/spark/pull/29872#issuecomment-699266831







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29872: [SPARK-32996][Web-UI] Handle empty ExecutorMetrics in ExecutorMetricsJsonSerializer

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29872:
URL: https://github.com/apache/spark/pull/29872#issuecomment-699266831







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #29798: [SPARK-32931][SQL] Unevaluable Expressions are not Foldable

2020-09-25 Thread GitBox


cloud-fan closed pull request #29798:
URL: https://github.com/apache/spark/pull/29798


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] fhoering commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-25 Thread GitBox


fhoering commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-698877710


   It would be nice to have K8s here indeed but I never deployed to K8s. So I 
will only do the small changes from above and let you  open anther JIRA ticket 
for someone else to write about K8s



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29797:
URL: https://github.com/apache/spark/pull/29797#issuecomment-698906945


   **[Test build #129110 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129110/testReport)**
 for PR 29797 at commit 
[`7e0d766`](https://github.com/apache/spark/commit/7e0d766b424cdcac27f4bb3b08e325886daf92b2).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29857: [SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13

2020-09-25 Thread GitBox


SparkQA commented on pull request #29857:
URL: https://github.com/apache/spark/pull/29857#issuecomment-698928101







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #29790: [SPARK-32914][SQL] Avoid calling dataType multiple times for each expression

2020-09-25 Thread GitBox


wangyum commented on a change in pull request #29790:
URL: https://github.com/apache/spark/pull/29790#discussion_r494810795



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##
@@ -3498,13 +3500,15 @@ object ArrayUnion {
   since = "2.4.0")
 case class ArrayIntersect(left: Expression, right: Expression) extends 
ArrayBinaryLike
   with ComplexTypeMergingExpression {
-  override def dataType: DataType = {
-dataTypeCheck

Review comment:
   Do you mean add it back?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29857: [SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13

2020-09-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29857:
URL: https://github.com/apache/spark/pull/29857#issuecomment-698937917







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29857: [SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13

2020-09-25 Thread GitBox


SparkQA removed a comment on pull request #29857:
URL: https://github.com/apache/spark/pull/29857#issuecomment-698906903


   **[Test build #129109 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129109/testReport)**
 for PR 29857 at commit 
[`f2a26c5`](https://github.com/apache/spark/commit/f2a26c571b37b6f8c3ad169c27e73a38a67160f2).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   >