from:"kiszk"

[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...

2018-11-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22818
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...

2018-11-04 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thanks! merging to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-25776][CORE]The disk write buffer size must be greater than 12

2018-11-04 Thread kiszk

Repository: spark
Updated Branches:
  refs/heads/master 463a67668 -> 6c9e5ac9d


[SPARK-25776][CORE]The disk write buffer size must be greater than 12

## What changes were proposed in this pull request?

 In `UnsafeSorterSpillWriter.java`, when we write a record to a spill file wtih 
` void write(Object baseObject, long baseOffset,  int recordLength, long 
keyPrefix)`, `recordLength` and `keyPrefix`  will be  written  the disk write 
buffer  first, and these will take 12 bytes, so the disk write buffer size must 
be greater than 12.

 If `diskWriteBufferSize` is  10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10
   at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)
at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_

## How was this patch tested?
Existing UT in `UnsafeExternalSorterSuite`

Closes #22754 from 10110346/diskWriteBufferSize.

Authored-by: liuxian 
Signed-off-by: Kazuaki Ishizaki 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c9e5ac9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c9e5ac9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c9e5ac9

Branch: refs/heads/master
Commit: 6c9e5ac9de3d0ae5ea86b768608b42b5feb46df4
Parents: 463a676
Author: liuxian 
Authored: Mon Nov 5 01:55:13 2018 +0900
Committer: Kazuaki Ishizaki 
Committed: Mon Nov 5 01:55:13 2018 +0900

--
 .../util/collection/unsafe/sort/UnsafeSorterSpillWriter.java   | 5 -
 .../main/scala/org/apache/spark/internal/config/package.scala  | 6 --
 2 files changed, 8 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6c9e5ac9/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
--
diff --git 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
index 9399024..c1d71a2 100644
--- 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
+++ 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
@@ -42,7 +42,10 @@ public final class UnsafeSorterSpillWriter {
 
   private final SparkConf conf = new SparkConf();
 
-  /** The buffer size to use when writing the sorted records to an on-disk 
file */
+  /**
+   * The buffer size to use when writing the sorted records to an on-disk 
file, and
+   * this space used by prefix + len + recordLength must be greater than 4 + 8 
bytes.
+   */
   private final int diskWriteBufferSize =
 (int) (long) conf.get(package$.MODULE$.SHUFFLE_DISK_WRITE_BUFFER_SIZE());
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6c9e5ac9/core/src/main/scala/org/apache/spark/internal/config/package.scala
--
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index 034e5eb..c8993e1 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -21,6 +21,7 @@ import java.util.concurrent.TimeUnit
 
 import org.apache.spark.launcher.SparkLauncher
 import org.apache.spark.network.util.ByteUnit
+import org.apache.spark.unsafe.array.ByteArrayMethods
 import org.apache.spark.util.Utils
 
 package object config {
@@ -504,8 +505,9 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted records 
to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH,
+s"The buffer size must be greater than 12 and less than or equal to " +
+  s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
   .createWithDefault(1024 * 1024)
 
   private[spark] val UNROLL_MEMORY_CHECK_PERIOD =


-
To unsubscribe, e-mail:

[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...

2018-11-04 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22818
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...

2018-11-03 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22818
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...

2018-11-03 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22754
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...

2018-11-03 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22754
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22898: [SPARK-25746][SQL][followup] do not add unnecessary If e...

2018-10-31 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22898
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-31 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r229577559
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,18 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source-code splitting in the codegen. When the 
number of characters " +
+  "in a single JAVA function (without comment) exceeds the threshold, 
the function will be " +
--- End diff --

nit: `JAVA` -> `Java`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-31 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r229577345
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
+  "be generated, so use the code length as metric. A function's 
bytecode should not go " +
+  "beyond 8KB, otherwise it will not be JITted; it also should not be 
too small, otherwise " +
+  "there will be many function calls.")
+.intConf
--- End diff --

1000 is conservative. But, there is no recommendation since the bytecode 
size depends on the content (e.g. `0`'s byte code length is 1. `9`'s byte code 
lengh is 2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22891: SPARK-25881

2018-10-30 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22891
  
Thank you for your contribution.
Could you please write appropriate title and descriptions based on 
http://spark.apache.org/contributing.html ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22881: [SPARK-25855][CORE] Don't use erasure coding for ...

2018-10-29 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22881#discussion_r229155491
  
--- Diff: docs/configuration.md ---
@@ -761,6 +761,17 @@ Apart from these, the following properties are also 
available, and may be useful
 Compression will use spark.io.compression.codec.
   
 
+
+  spark.eventLog.allowErasureCoding
+  false
+  
+Whether to allow event logs to use erasure coding, or turn erasure 
coding off, regardless of
+filesystem defaults.  On HDFS, erasure coded files will not update as 
quickly as regular
+replicated files, so they application updates will take longer to 
appear in the History Server.
+Note that even if this is true, spark will still not force the file to 
erasure coding, it will
--- End diff --

nit: `to erasure coding` -> `to use erasure coding`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22881: [SPARK-25855][CORE] Don't use erasure coding for ...

2018-10-29 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22881#discussion_r229154733
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -471,4 +473,42 @@ object SparkHadoopUtil {
   hadoopConf.set(key.substring("spark.hadoop.".length), value)
 }
   }
+
+
+  lazy val builderReflection: Option[(Class[_], Method, Method)] = Try {
+val cls = Utils.classForName(
+  
"org.apache.hadoop.hdfs.DistributedFileSystem$HdfsDataOutputStreamBuilder")
+(cls, cls.getMethod("replicate"), cls.getMethod("build"))
+  }.toOption
+
+  // scalastyle:off line.size.limit
+  /**
+   * Create a path that uses replication instead of erasure coding, 
regardless of the default
+   * configuration in hdfs for the given path.  This can be helpful as 
hdfs ec doesn't support
--- End diff --

nit: `ec` -> `erasure coding`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22877: [MINOR][SQL] Avoid hardcoded configuration keys i...

2018-10-29 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22877#discussion_r229148363
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -408,15 +408,16 @@ object SQLConf {
 
   val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = 
buildConf("spark.sql.parquet.filterPushdown.date")
 .doc("If true, enables Parquet filter push-down optimization for Date. 
" +
-  "This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+  s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
--- End diff --

Got it, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22879: [SPARK-25872][SQL][TEST] Add an optimizer tracker for TP...

2018-10-29 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22879
  
cc @maropu


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22877: [MINOR][SQL] Avoid hardcoded configuration keys i...

2018-10-29 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22877#discussion_r229034778
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -408,15 +408,16 @@ object SQLConf {
 
   val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = 
buildConf("spark.sql.parquet.filterPushdown.date")
 .doc("If true, enables Parquet filter push-down optimization for Date. 
" +
-  "This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+  s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
--- End diff --

nit: Can we apply the same policy to `spark.sql.parquet.compression.codec` 
at L397?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22755: [SPARK-25755][SQL][Test] Supplementation of non-CodeGen ...

2018-10-28 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22755
  
Is it better to apply this util method to others (e.g. 
`DataFrameRangeSuite.scala` and `DataFrameAggregateSuite.scala`)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22818: [SPARK-25827][CORE] Allocate arrays smaller than Int.Max...

2018-10-28 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22818
  
Since this PR is not a blocker for 2.4, I think that it would be good to 
address these issues as possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-28 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r228738630
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
 ---
@@ -42,7 +42,9 @@
 
   private final SparkConf conf = new SparkConf();
 
-  /** The buffer size to use when writing the sorted records to an on-disk 
file */
+  /** The buffer size to use when writing the sorted records to an on-disk 
file, and
--- End diff --

nit: For a multiple-line comment, a starting line `/**` does not have a 
text.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2018-10-26 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19601
  
Sure, let me close this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

2018-10-26 Thread kiszk

Github user kiszk closed the pull request at:

https://github.com/apache/spark/pull/19601


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r228601036
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
 ---
@@ -62,6 +62,8 @@ public UnsafeSorterSpillWriter(
   int fileBufferSize,
   ShuffleWriteMetrics writeMetrics,
   int numRecordsToWrite) throws IOException {
+// Space used by prefix + len + recordLength is more than 4 + 8 bytes
+assert (diskWriteBufferSize > 12);
--- End diff --

Where is the best place of this comment? I am neutral on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r228598058
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The maximum source code length of a single Java function by 
codegen. When the " +
+  "generated Java function source code exceeds this threshold, it will 
be split into " +
+  "multiple small functions, each function length is 
spark.sql.codegen.methodSplitThreshold." +
--- End diff --

IMHO, `spark.sql.codegen.methodSplitThreshold` can be used, but the 
description should be changed. For example,
`The threshold of source code length without comment of a single Java 
function by codegen to be split. When the generated Java function source code 
exceeds this threshold, it will be split into multiple small functions. ...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22818: [SPARK-25827][CORE] Allocate arrays smaller than Int.Max...

2018-10-25 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22818
  
Thanks, would it be also possible to double-check `Integer.MAX_VALUE` if 
you have not checked yet?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-24 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r227780781
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -495,8 +495,8 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted 
records to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= Int.MaxValue,
+s"The buffer size must be greater than 12 and less than 
${Int.MaxValue}.")
--- End diff --

How about using [this 
value](https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L52)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-24 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r227729436
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -495,8 +495,8 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted 
records to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= Int.MaxValue,
+s"The buffer size must be greater than 12 and less than 
${Int.MaxValue}.")
--- End diff --

Thanks, I image this is due to a recent issue in Github.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22798: [SPARK-25803] Fix docker-image-tool.sh -n option

2018-10-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22798
  
Based on bash syntax, this change makes sense. I would like to wait for 
@vanzin 's comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-23 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r227363331
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -495,8 +495,8 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted 
records to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= Int.MaxValue,
+s"The buffer size must be greater than 12 and less than 
${Int.MaxValue}.")
--- End diff --

Sorry for bothering you. Can we handle this in this PR, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22803: change jsr305 version from 1.3.9 to 3.0.0

2018-10-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22803
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Sorry for my mistake. My keyboard '4' sometimes has a trouble.
> I think, INT_MAX is 2147483647, so n ~= sqrt(2*2147483647) = 65536.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22800: [SPARK-24499][SQL][DOC][follow-up] Fix spelling in doc

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22800
  
cc @cloud-fan @gatorsmile @HyukjinKwon @xuanyuanking


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22800: [SPARK-24499][SQL][DOC][follow-up] Fix spelling i...

2018-10-22 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/22800

[SPARK-24499][SQL][DOC][follow-up] Fix spelling in doc

## What changes were proposed in this pull request?

This PR replaces `turing` with `tuning` in files and a file name. 
Currently, in the left side menu, `Turing` is shown.

![image](https://user-images.githubusercontent.com/1315079/47332714-20a96180-d6bb-11e8-9a5a-0a8dad292626.png)

## How was this patch tested?

`grep -rin turing docs` && `find docs -name "*turing*"`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-24499-follow

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22800.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22800


commit a13493b2e3ede38129de6d32ea19d6886cc13b80
Author: Kazuaki Ishizaki 
Date:   2018-10-23T02:56:18Z

turing -> tuning




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
One question: After this PR, what is the maximum column that we can accept?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does this 
limitation `65,500` come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does this 
limitation `65,500` come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does this 
limitation `65,500` come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
Can I clarify the description?
> Because we are passing an array of size n*(n+1)/2 to the breeze library 
and the size cannot be more than INT_MAX. so, the maximum column size we can 
give is 65,500.

If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` 
come from?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22789: [SPARK-25767][SQL] fix inputVars preparation if outputVa...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22789
  
Thank you for submitting a PR. Would it be possible to add a test case, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE][MINOR]The disk write buffer s...

2018-10-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r226872137
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -495,8 +495,8 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted 
records to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= Int.MaxValue,
+s"The buffer size must be greater than 12 and less than 
${Int.MaxValue}.")
--- End diff --

As recently fixed in #22705, we cannot allocate a Java array with 
`Int.MaxValue`. Could you please fix this, too?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your clarification.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22754: [SPARK-25776][CORE][MINOR]The disk write buffer s...

2018-10-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r226871894
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
 ---
@@ -62,6 +62,8 @@ public UnsafeSorterSpillWriter(
   int fileBufferSize,
   ShuffleWriteMetrics writeMetrics,
   int numRecordsToWrite) throws IOException {
+// space used by prefix + len is (4 + 8) bytes
--- End diff --

Can we refine this comment to explain more than 12 bytes are required?
For example, `space used by prefix + len + recordLength is more than 4 + 8 
bytes`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22784
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22765: [SPARK-25757][Build] Upgrade netty-all from 4.1.17.Final...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22765
  
LGTM, pending Jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22765: [SPARK-25757][Build] Upgrade netty-all from 4.1.17.Final...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22765
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22782: [HOTFIX] Fix PySpark pip packaging tests by non-ascii co...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22782
  
I will visit here tomorrow morning in Japan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22782: [HOTFIX] Fix PySpark pip packaging tests by non-ascii co...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22782
  
LGTM, pending Jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22782: [HOTFIX] Fix PySpark pip packaging tests by non-ascii co...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22782
  
Thank you for this hot fix. I found `0xc2`  after `#` in 
`docker-image-tool.sh` where @HyukjinKwon fixed.

```
> git log | head -1
commit fc9ba9dcc6ad47fbd05f093b94e7e1358d5f
[ishizaki@nx13] /home/ishizaki/Spark/PR/tmp/spark > od -c -t x1 
bin/docker-image-tool.sh  | grep -A 4 -B 4 c2
 75  61  6c  6c  79  20  62  65  65  6e  20  62  75  69  6c  74
0005200   /   i   s   a   r   u   n   n   a   b   l   e   d
 2f  69  73  20  61  20  72  75  6e  6e  61  62  6c  65  20  64
0005220   i   s   t   r   i   b   u   t   i   o   n  \n   # 302
 69  73  74  72  69  62  75  74  69  6f  6e  0a  20  20  23  c2
0005240 240   i   .   e   .   t   h   e   S   p   a   r   k
 a0  69  2e  65  2e  20  74  68  65  20  53  70  61  72  6b  20
0005260   J   A   R   s   t   h   a   t   t   h   e   D   o
 4a  41  52  73  20  74  68  61  74  20  74  68  65  20  44  6f
```


```
Installing collected packages: py4j, pyspark
  Running setup.py develop for pyspark
Complete output from command /tmp/tmp.EWtmCOYUBn/3.5/bin/python -c 
"import setuptools, 
tokenize;__file__='/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py';f=getattr(tokenize,
 'open', open)(__file__);code=f.read().replace('\r\n', 
'\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
running develop
running egg_info
writing dependency_links to pyspark.egg-info/dependency_links.txt
writing pyspark.egg-info/PKG-INFO
writing requirements to pyspark.egg-info/requires.txt
writing top-level names to pyspark.egg-info/top_level.txt
Could not import pypandoc - required to package PySpark
package init file 'deps/bin/__init__.py' not found (or not a regular 
file)
package init file 'deps/jars/__init__.py' not found (or not a regular 
file)
package init file 'pyspark/python/pyspark/__init__.py' not found (or 
not a regular file)
package init file 'lib/__init__.py' not found (or not a regular file)
package init file 'deps/data/__init__.py' not found (or not a regular 
file)
package init file 'deps/licenses/__init__.py' not found (or not a 
regular file)
package init file 'deps/examples/__init__.py' not found (or not a 
regular file)
reading manifest file 'pyspark.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*.py[cod]' found 
anywhere in distribution
warning: no previously-included files matching '__pycache__' found 
anywhere in distribution
warning: no previously-included files matching '.DS_Store' found 
anywhere in distribution
writing manifest file 'pyspark.egg-info/SOURCES.txt'
running build_ext
Creating 
/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/pyspark.egg-link (link to .)
Adding pyspark 3.0.0.dev0 to easy-install.pth file
Installing load-spark-env.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-submit script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-class.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing beeline.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing find-spark-home.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing run-example script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-shell2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing pyspark script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing sparkR script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-sql script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-submit.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-shell script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing beeline script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-submit2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing find-spark-home script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing sparkR.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing run-example.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing sparkR2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-shell.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-sql.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-class2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py", line 224, in 

'Programming Language :: Python :: Implementation :: PyPy']
  File 
"/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-pa

[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-20 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r226833103
  
--- Diff: dev/run-tests.py ---
@@ -551,7 +551,8 @@ def main():
 if not changed_files or any(f.endswith(".scala")
 or f.endswith("scalastyle-config.xml")
 for f in changed_files):
-run_scala_style_checks()
+# run_scala_style_checks()
--- End diff --

Got it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-20 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r226833042
  
--- Diff: python/pyspark/__init__.py ---
@@ -16,7 +16,7 @@
 #
 
 """
-PySpark is the Python API for Spark.
+PySpark is the Python API for Spark
--- End diff --

Is this intentional change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-20 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r226833028
  
--- Diff: dev/run-tests.py ---
@@ -551,7 +551,8 @@ def main():
 if not changed_files or any(f.endswith(".scala")
 or f.endswith("scalastyle-config.xml")
 for f in changed_files):
-run_scala_style_checks()
+# run_scala_style_checks()
--- End diff --

Is this change necessary? Or just tentative workaround?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22501
  
Thanks, I found `0xc2` in `docker-image-tool.sh`. I will put my finding 
into #22782


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22501
  
Is 
[this](https://github.com/apache/spark/pull/22748#issuecomment-431512558) the 
oldest test failure related to this type of failure?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22501
  
Thanks, when it was successful, this is a part of log from 
[this](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97378/consoleText)
```
copying pyspark/streaming/util.py -> pyspark-3.0.0.dev0/pyspark/streaming
Writing pyspark-3.0.0.dev0/setup.cfg
Creating tar archive
removing 'pyspark-3.0.0.dev0' (and everything under it)
Installing dist into virtual env
Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder/python
Collecting py4j==0.10.7 (from pyspark==3.0.0.dev0)
  Downloading 
https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl
 (197kB)
mkl-random 1.0.1 requires cython, which is not installed.
Installing collected packages: py4j, pyspark
  Running setup.py develop for pyspark
Successfully installed py4j-0.10.7 pyspark
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Run basic sanity check on pip installed version with spark-submit
```

Now, we are seeing the following
```
copying pyspark/streaming/util.py -> pyspark-3.0.0.dev0/pyspark/streaming
Writing pyspark-3.0.0.dev0/setup.cfg
Creating tar archive
removing 'pyspark-3.0.0.dev0' (and everything under it)
Installing dist into virtual env
Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder/python
Collecting py4j==0.10.7 (from pyspark==3.0.0.dev0)
  Downloading 
https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl
 (197kB)
mkl-random 1.0.1 requires cython, which is not installed.
Installing collected packages: py4j, pyspark
  Running setup.py develop for pyspark
Complete output from command /tmp/tmp.EWtmCOYUBn/3.5/bin/python -c 
"import setuptools, 
tokenize;__file__='/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py';f=getattr(tokenize,
 'open', open)(__file__);code=f.read().replace('\r\n', 
'\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
running develop
running egg_info
writing dependency_links to pyspark.egg-info/dependency_links.txt
writing pyspark.egg-info/PKG-INFO
writing requirements to pyspark.egg-info/requires.txt
writing top-level names to pyspark.egg-info/top_level.txt
Could not import pypandoc - required to package PySpark
package init file 'deps/bin/__init__.py' not found (or not a regular 
file)
package init file 'deps/jars/__init__.py' not found (or not a regular 
file)
package init file 'pyspark/python/pyspark/__init__.py' not found (or 
not a regular file)
package init file 'lib/__init__.py' not found (or not a regular file)
package init file 'deps/data/__init__.py' not found (or not a regular 
file)
package init file 'deps/licenses/__init__.py' not found (or not a 
regular file)
package init file 'deps/examples/__init__.py' not found (or not a 
regular file)
reading manifest file 'pyspark.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*.py[cod]' found 
anywhere in distribution
warning: no previously-included files matching '__pycache__' found 
anywhere in distribution
warning: no previously-included files matching '.DS_Store' found 
anywhere in distribution
writing manifest file 'pyspark.egg-info/SOURCES.txt'
running build_ext
Creating 
/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/pyspark.egg-link (link to .)
Adding pyspark 3.0.0.dev0 to easy-install.pth file
Installing load-spark-env.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-submit script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-class.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing beeline.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing find-spark-home.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing run-example script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-shell2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing pyspark script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing sparkR script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-sql script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-submit.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-shell script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing beeline script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing spark-submit2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
Installing find-spark-home script to /tmp/tmp.EW

[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22501
  
I am looking at each commit from the latest to old at 
https://github.com/apache/spark/commits/master 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needsUnsafeR...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22750
  
You are right. Sorry, nvm
> DataSourceScanExec does not have needsUnsafeRowConversion


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needsUnsafeR...

2018-10-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22750
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needsUnsafeR...

2018-10-19 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22750
  
I thought about the last line
> This PR removes ColumnarBatchScan.needsUnsafeRowConversion, and keep this 
flag only in FileSourceScanExec


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needsUnsafeR...

2018-10-18 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22750
  
Do we need to update the description? For example, 
`needsUnsafeRowConversion` exists in `DataSourceScanExec` now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22754: [MINOR][CORE]The disk write buffer size must be greater ...

2018-10-18 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22754
  
Good catch.
One question: can we set `12` into this property?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226263066
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+
+  
+SELECT  array_contains(array(1), 'anystring');
+  
+  
+null
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
--- End diff --

`explict` -> `explicit`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226262995
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
--- End diff --

`explict` -> `explicit`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226250306
  
--- Diff: docs/sql-performance-turing.md ---
@@ -0,0 +1,151 @@
+---
+layout: global
+title: Performance Tuning
+displayTitle: Performance Tuning
+---
+
+* Table of contents
+{:toc}
+
+For some workloads, it is possible to improve performance by either 
caching data in memory, or by
+turning on some experimental options.
+
+## Caching Data In Memory
+
+Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
+Then Spark SQL will scan only required columns and will automatically tune 
compression to minimize
+memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` to remove the table from memory.
+
+Configuration of in-memory caching can be done using the `setConf` method 
on `SparkSession` or by running
+`SET key=value` commands using SQL.
+
+
+Property NameDefaultMeaning
+
+  spark.sql.inMemoryColumnarStorage.compressed
+  true
+  
+When set to true Spark SQL will automatically select a compression 
codec for each column based
+on statistics of the data.
+  
+
+
+  spark.sql.inMemoryColumnarStorage.batchSize
+  1
+  
+Controls the size of batches for columnar caching. Larger batch sizes 
can improve memory utilization
+and compression, but risk OOMs when caching data.
+  
+
+
+
+
+## Other Configuration Options
+
+The following options can also be used to tune the performance of query 
execution. It is possible
+that these options will be deprecated in future release as more 
optimizations are performed automatically.
+
+
+  Property NameDefaultMeaning
+  
+spark.sql.files.maxPartitionBytes
+134217728 (128 MB)
+
+  The maximum number of bytes to pack into a single partition when 
reading files.
+
+  
+  
+spark.sql.files.openCostInBytes
+4194304 (4 MB)
+
+  The estimated cost to open a file, measured by the number of bytes 
could be scanned in the same
+  time. This is used when putting multiple files into a partition. It 
is better to over estimated,
--- End diff --

nit: `It is better to over estimated` -> ` It is better to over-estimate`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226247607
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+
+  
+SELECT  array_contains(array(1), 'anystring');
+  
+  
+null
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+  
+
+  - Since Spark 2.4, when there is a struct field in front of the IN 
operator before a subquery, the inner query must contain a struct field as 
well. In previous versions, instead, the fields of the struct were compared to 
the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in 
Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, 
while `a in (select 1, 'a' from range(1))` is not. In previous version it was 
the opposite.
+  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to 
true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly 
became case-sensitive and would resolve to columns (unless typed in lower 
case). In Spark 2.4 this has been fixed and the functions are no longer 
case-sensitive.
+  - Since Spark 2.4, Spark will evaluate the set operations referenced in 
a query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+  - Since Spark 2.4, Spark will display table description column Last 
Access value as UNKNOWN when the value was Jan 01 1970.
+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226246375
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+
+  
+SELECT  array_contains(array(1), 'anystring');
+  
+  
+null
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+  
+
+  - Since Spark 2.4, when there is a struct field in front of the IN 
operator before a subquery, the inner query must contain a struct field as 
well. In previous versions, instead, the fields of the struct were compared to 
the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in 
Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, 
while `a in (select 1, 'a' from range(1))` is not. In previous version it was 
the opposite.
+  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to 
true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly 
became case-sensitive and would resolve to columns (unless typed in lower 
case). In Spark 2.4 this has been fixed and the functions are no longer 
case-sensitive.
+  - Since Spark 2.4, Spark will evaluate the set operations referenced in 
a query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+  - Since Spark 2.4, Spark will display table description column Last 
Access value as UNKNOWN when the value was Jan 01 1970.
+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226245945
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
--- End diff --

`the builder come` -> `the builder comes`?
cc @ueshin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226241683
  
--- Diff: docs/sql-distributed-sql-engine.md ---
@@ -0,0 +1,85 @@
+---
+layout: global
+title: Distributed SQL Engine
+displayTitle: Distributed SQL Engine
+---
+
+* Table of contents
+{:toc}
+
+Spark SQL can also act as a distributed query engine using its JDBC/ODBC 
or command-line interface.
+In this mode, end-users or applications can interact with Spark SQL 
directly to run SQL queries,
+without the need to write any code.
+
+## Running the Thrift JDBC/ODBC server
+
+The Thrift JDBC/ODBC server implemented here corresponds to the 
[`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
+in Hive 1.2.1. You can test the JDBC server with the beeline script that 
comes with either Spark or Hive 1.2.1.
+
+To start the JDBC/ODBC server, run the following in the Spark directory:
+
+./sbin/start-thriftserver.sh
+
+This script accepts all `bin/spark-submit` command line options, plus a 
`--hiveconf` option to
+specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` 
for a complete list of
+all available options. By default, the server listens on localhost:1. 
You may override this
+behaviour via either environment variables, i.e.:
+
+{% highlight bash %}
+export HIVE_SERVER2_THRIFT_PORT=
+export HIVE_SERVER2_THRIFT_BIND_HOST=
+./sbin/start-thriftserver.sh \
+  --master  \
+  ...
+{% endhighlight %}
+
+or system properties:
+
+{% highlight bash %}
+./sbin/start-thriftserver.sh \
+  --hiveconf hive.server2.thrift.port= \
+  --hiveconf hive.server2.thrift.bind.host= \
+  --master 
+  ...
+{% endhighlight %}
+
+Now you can use beeline to test the Thrift JDBC/ODBC server:
+
+./bin/beeline
+
+Connect to the JDBC/ODBC server in beeline with:
+
+beeline> !connect jdbc:hive2://localhost:1
+
+Beeline will ask you for a username and password. In non-secure mode, 
simply enter the username on
+your machine and a blank password. For secure mode, please follow the 
instructions given in the
+[beeline 
documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).
+
+Configuration of Hive is done by placing your `hive-site.xml`, 
`core-site.xml` and `hdfs-site.xml` files in `conf/`.
+
+You may also use the beeline script that comes with Hive.
+
+Thrift JDBC server also supports sending thrift RPC messages over HTTP 
transport.
+Use the following setting to enable HTTP mode as system property or in 
`hive-site.xml` file in `conf/`:
+
+hive.server2.transport.mode - Set this to value: http
+hive.server2.thrift.http.port - HTTP port number to listen on; default 
is 10001
+hive.server2.http.endpoint - HTTP endpoint; default is cliservice
+
+To test, use beeline to connect to the JDBC/ODBC server in http mode with:
+
+beeline> !connect 
jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
+
+
+## Running the Spark SQL CLI
+
+The Spark SQL CLI is a convenient tool to run the Hive metastore service 
in local mode and execute
+queries input from the command line. Note that the Spark SQL CLI cannot 
talk to the Thrift JDBC server.
+
+To start the Spark SQL CLI, run the following in the Spark directory:
+
+./bin/spark-sql
+
+Configuration of Hive is done by placing your `hive-site.xml`, 
`core-site.xml` and `hdfs-site.xml` files in `conf/`.
+You may run `./bin/spark-sql --help` for a complete list of all available
+options.
--- End diff --

super nit: this line can be concatenated with the previous line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226239048
  
--- Diff: docs/sql-data-sources-parquet.md ---
@@ -0,0 +1,321 @@
+---
+layout: global
+title: Parquet Files
+displayTitle: Parquet Files
+---
+
+* Table of contents
+{:toc}
+
+[Parquet](http://parquet.io) is a columnar format that is supported by 
many other data processing systems.
+Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
+of the original data. When writing Parquet files, all columns are 
automatically converted to be nullable for
+compatibility reasons.
+
+### Loading Data Programmatically
+
+Using the data from the above example:
+
+
+
+
+{% include_example basic_parquet_example 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example basic_parquet_example 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+{% include_example basic_parquet_example python/sql/datasource.py %}
+
+
+
+
+{% include_example basic_parquet_example r/RSparkSQLExample.R %}
+
+
+
+
+
+{% highlight sql %}
+
+CREATE TEMPORARY VIEW parquetTable
+USING org.apache.spark.sql.parquet
+OPTIONS (
+  path "examples/src/main/resources/people.parquet"
+)
+
+SELECT * FROM parquetTable
+
+{% endhighlight %}
+
+
+
+
+
+### Partition Discovery
+
+Table partitioning is a common optimization approach used in systems like 
Hive. In a partitioned
+table, data are usually stored in different directories, with partitioning 
column values encoded in
+the path of each partition directory. All built-in file sources (including 
Text/CSV/JSON/ORC/Parquet)
+are able to discover and infer partitioning information automatically.
+For example, we can store all our previously used
+population data into a partitioned table using the following directory 
structure, with two extra
+columns, `gender` and `country` as partitioning columns:
+
+{% highlight text %}
+
+path
+âââ to
+âââ table
+âââ gender=male
+âÂ Â  âââ ...
+âÂ Â  â
+âÂ Â  âââ country=US
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ country=CN
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ ...
+âââ gender=female
+ Â Â  âââ ...
+ Â Â  â
+ Â Â  âââ country=US
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ country=CN
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ ...
+
+{% endhighlight %}
+
+By passing `path/to/table` to either `SparkSession.read.parquet` or 
`SparkSession.read.load`, Spark SQL
+will automatically extract the partitioning information from the paths.
+Now the schema of the returned DataFrame becomes:
+
+{% highlight text %}
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- gender: string (nullable = true)
+|-- country: string (nullable = true)
+
+{% endhighlight %}
+
+Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
+numeric data types, date, timestamp and string type are supported. 
Sometimes users may not want
+to automatically infer the data types of the partitioning columns. For 
these use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default 
to `true`. When type
+inference is disabled, string type will be used for the partitioning 
columns.
+
+Starting from Spark 1.6.0, partition discovery only finds partitions under 
the given paths
+by default. For the above example, if users pass 
`path/to/table/gender=male` to either
+`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not 
be considered as a
+partitioning column. If users need to specify the base path that partition 
discovery
+should start with, they can set `basePath` in the data source options. For 
example,
+when `path/to/table/gender=male` is the path of the data and
+users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
+
+### Schema Merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema 
evolution. Users can start with
+a simple schema, and gradually add more columns to the schema as needed. 
In this way, users may end
+up with multiple Parquet files with different but mutually compatible 
schemas. T

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226237047
  
--- Diff: docs/sql-data-sources-parquet.md ---
@@ -0,0 +1,321 @@
+---
+layout: global
+title: Parquet Files
+displayTitle: Parquet Files
+---
+
+* Table of contents
+{:toc}
+
+[Parquet](http://parquet.io) is a columnar format that is supported by 
many other data processing systems.
+Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
+of the original data. When writing Parquet files, all columns are 
automatically converted to be nullable for
+compatibility reasons.
+
+### Loading Data Programmatically
+
+Using the data from the above example:
+
+
+
+
+{% include_example basic_parquet_example 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example basic_parquet_example 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+{% include_example basic_parquet_example python/sql/datasource.py %}
+
+
+
+
+{% include_example basic_parquet_example r/RSparkSQLExample.R %}
+
+
+
+
+
+{% highlight sql %}
+
+CREATE TEMPORARY VIEW parquetTable
+USING org.apache.spark.sql.parquet
+OPTIONS (
+  path "examples/src/main/resources/people.parquet"
+)
+
+SELECT * FROM parquetTable
+
+{% endhighlight %}
+
+
+
+
+
+### Partition Discovery
+
+Table partitioning is a common optimization approach used in systems like 
Hive. In a partitioned
+table, data are usually stored in different directories, with partitioning 
column values encoded in
+the path of each partition directory. All built-in file sources (including 
Text/CSV/JSON/ORC/Parquet)
+are able to discover and infer partitioning information automatically.
+For example, we can store all our previously used
+population data into a partitioned table using the following directory 
structure, with two extra
+columns, `gender` and `country` as partitioning columns:
+
+{% highlight text %}
+
+path
+âââ to
+âââ table
+âââ gender=male
+âÂ Â  âââ ...
+âÂ Â  â
+âÂ Â  âââ country=US
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ country=CN
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ ...
+âââ gender=female
+ Â Â  âââ ...
+ Â Â  â
+ Â Â  âââ country=US
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ country=CN
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ ...
+
+{% endhighlight %}
+
+By passing `path/to/table` to either `SparkSession.read.parquet` or 
`SparkSession.read.load`, Spark SQL
+will automatically extract the partitioning information from the paths.
+Now the schema of the returned DataFrame becomes:
+
+{% highlight text %}
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- gender: string (nullable = true)
+|-- country: string (nullable = true)
+
+{% endhighlight %}
+
+Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
+numeric data types, date, timestamp and string type are supported. 
Sometimes users may not want
+to automatically infer the data types of the partitioning columns. For 
these use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default 
to `true`. When type
+inference is disabled, string type will be used for the partitioning 
columns.
+
+Starting from Spark 1.6.0, partition discovery only finds partitions under 
the given paths
+by default. For the above example, if users pass 
`path/to/table/gender=male` to either
+`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not 
be considered as a
+partitioning column. If users need to specify the base path that partition 
discovery
+should start with, they can set `basePath` in the data source options. For 
example,
+when `path/to/table/gender=male` is the path of the data and
+users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
+
+### Schema Merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema 
evolution. Users can start with
--- End diff --

`ProtocolBuffer` -> `Protocol Buffers`


---

-
To unsubscribe, e-mail: rev

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226235672
  
--- Diff: docs/sql-data-sources-load-save-functions.md ---
@@ -0,0 +1,283 @@
+---
+layout: global
+title: Generic Load/Save Functions
+displayTitle: Generic Load/Save Functions
+---
+
+* Table of contents
+{:toc}
+
+
+In the simplest form, the default data source (`parquet` unless otherwise 
configured by
+`spark.sql.sources.default`) will be used for all operations.
+
+
+
+
+{% include_example generic_load_save_functions 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example generic_load_save_functions 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+{% include_example generic_load_save_functions python/sql/datasource.py %}
+
+
+
+
+{% include_example generic_load_save_functions r/RSparkSQLExample.R %}
+
+
+
+
+### Manually Specifying Options
+
+You can also manually specify the data source that will be used along with 
any extra options
+that you would like to pass to the data source. Data sources are specified 
by their fully qualified
+name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you 
can also use their short
+names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). 
DataFrames loaded from any data
+source type can be converted into other types using this syntax.
+
+To load a JSON file you can use:
+
+
+
+{% include_example manual_load_options 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example manual_load_options 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example manual_load_options python/sql/datasource.py %}
+
+
+
+{% include_example manual_load_options r/RSparkSQLExample.R %}
+
+
+
+To load a CSV file you can use:
+
+
+
+{% include_example manual_load_options_csv 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example manual_load_options_csv 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example manual_load_options_csv python/sql/datasource.py %}
+
+
+
+{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
+
+
+
+
+### Run SQL on files directly
+
+Instead of using read API to load a file into DataFrame and query it, you 
can also query that
+file directly with SQL.
+
+
+
+{% include_example direct_sql 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example direct_sql 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example direct_sql python/sql/datasource.py %}
+
+
+
+{% include_example direct_sql r/RSparkSQLExample.R %}
+
+
+
+
+### Save Modes
+
+Save operations can optionally take a `SaveMode`, that specifies how to 
handle existing data if
+present. It is important to realize that these save modes do not utilize 
any locking and are not
+atomic. Additionally, when performing an `Overwrite`, the data will be 
deleted before writing out the
+new data.
+
+
+Scala/JavaAny LanguageMeaning
+
+  SaveMode.ErrorIfExists (default)
+  "error" or "errorifexists" (default)
+  
+When saving a DataFrame to a data source, if data already exists,
+an exception is expected to be thrown.
+  
+
+
+  SaveMode.Append
+  "append"
+  
+When saving a DataFrame to a data source, if data/table already exists,
+contents of the DataFrame are expected to be appended to existing data.
+  
+
+
+  SaveMode.Overwrite
+  "overwrite"
+  
+Overwrite mode means that when saving a DataFrame to a data source,
+if data/table already exists, existing data is expected to be 
overwritten by the contents of
+the DataFrame.
+  
+
+
+  SaveMode.Ignore
+  "ignore"
+  
+Ignore mode means that when saving a DataFrame to a data source, if 
data already exists,
+the save operation is expected to not save the contents of the 
DataFrame and to not
--- End diff --

nit: `expected to not ... to not ...` -> `expected not to ... not to ...`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226231876
  
--- Diff: docs/sql-data-sources-jdbc.md ---
@@ -0,0 +1,223 @@
+---
+layout: global
+title: JDBC To Other Databases
+displayTitle: JDBC To Other Databases
+---
+
+* Table of contents
+{:toc}
+
+Spark SQL also includes a data source that can read data from other 
databases using JDBC. This
+functionality should be preferred over using 
[JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
+This is because the results are returned
+as a DataFrame and they can easily be processed in Spark SQL or joined 
with other data sources.
+The JDBC data source is also easier to use from Java or Python as it does 
not require the user to
+provide a ClassTag.
+(Note that this is different than the Spark SQL JDBC server, which allows 
other applications to
+run queries using Spark SQL).
+
+To get started you will need to include the JDBC driver for your 
particular database on the
+spark classpath. For example, to connect to postgres from the Spark Shell 
you would run the
+following command:
+
+{% highlight bash %}
+bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars 
postgresql-9.4.1207.jar
+{% endhighlight %}
+
+Tables from the remote database can be loaded as a DataFrame or Spark SQL 
temporary view using
+the Data Sources API. Users can specify the JDBC connection properties in 
the data source options.
+user and password are normally provided as 
connection properties for
+logging into the data sources. In addition to the connection properties, 
Spark also supports
+the following case-insensitive options:
+
+
+  Property NameMeaning
+  
+url
+
+  The JDBC URL to connect to. The source-specific connection 
properties may be specified in the URL. e.g., 
jdbc:postgresql://localhost/test?user=fred=secret
+
+  
+
+  
+dbtable
+
+  The JDBC table that should be read from or written into. Note that 
when using it in the read
+  path anything that is valid in a FROM clause of a SQL 
query can be used.
+  For example, instead of a full table you could also use a subquery 
in parentheses. It is not
+  allowed to specify `dbtable` and `query` options at the same time.
+
+  
+  
+query
+
+  A query that will be used to read data into Spark. The specified 
query will be parenthesized and used
+  as a subquery in the FROM clause. Spark will also 
assign an alias to the subquery clause.
+  As an example, spark will issue a query of the following form to the 
JDBC Source.
+   SELECT columns FROM (user_specified_query) 
spark_gen_alias
+  Below are couple of restrictions while using this option.
+  
+  It is not allowed to specify `dbtable` and `query` options 
at the same time. 
+  It is not allowed to spcify `query` and `partitionColumn` 
options at the same time. When specifying
--- End diff --

`spcify` -> `specify`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226227872
  
--- Diff: docs/sql-data-sources.md ---
@@ -0,0 +1,42 @@
+---
+layout: global
+title: Data Sources
+displayTitle: Data Sources
+---
+
+
+Spark SQL supports operating on a variety of data sources through the 
DataFrame interface.
+A DataFrame can be operated on using relational transformations and can 
also be used to create a temporary view.
+Registering a DataFrame as a temporary view allows you to run SQL queries 
over its data. This section
+describes the general methods for loading and saving data using the Spark 
Data Sources and then
+goes into specific options that are available for the built-in data 
sources.
+
+
+* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
+  * [Manually Sepcifying 
Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options)
--- End diff --

`sepcifying` -> `specifying`. In other places, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22755: [SPARK-25755][SQL][Test] Supplementation of non-CodeGen ...

2018-10-18 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22755
  
cc @maropu



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22617: [SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsa...

2018-10-17 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22617
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22729: [SPARK-25737][CORE] Remove JavaSparkContextVarargsWorkar...

2018-10-17 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22729
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22705: [SPARK-25704][CORE] Allocate a bit less than Int.MaxValu...

2018-10-17 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22705
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22705: [SPARK-25704][CORE] Allocate a bit less than Int.MaxValu...

2018-10-16 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22705
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22708: [SPARK-21402][SQL] Fix java array of structs deserializa...

2018-10-16 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22708
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22708: [SPARK-21402] Fix java array/map of structs deser...

2018-10-16 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22708#discussion_r225440522
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -30,6 +30,7 @@ import org.apache.spark.serializer._
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, 
ScalaReflection}
 import org.apache.spark.sql.catalyst.ScalaReflection.universe.TermName
+import org.apache.spark.sql.catalyst.analysis.UnresolvedException
--- End diff --

nit: it does not seem to be necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22662: [SPARK-25627][TEST] Reduce test time for Continuo...

2018-10-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22662#discussion_r225000849
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala
 ---
@@ -259,10 +259,10 @@ class ContinuousStressSuite extends 
ContinuousSuiteBase {
 testStream(df, useV2Sink = true)(
   StartStream(Trigger.Continuous(2012)),
--- End diff --

Got it, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-10-14 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21537
  
@HyukjinKwon sorry for being late. I was swampped with several things. I 
have just submitted it. Looking forward to seeing feedback.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22662: [SPARK-25627][TEST] Reduce test time for Continuo...

2018-10-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22662#discussion_r224977443
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala
 ---
@@ -259,10 +259,10 @@ class ContinuousStressSuite extends 
ContinuousSuiteBase {
 testStream(df, useV2Sink = true)(
   StartStream(Trigger.Continuous(2012)),
--- End diff --

just curios: Is this value still `2012`? Can a reduction to `1012` reduce 
time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22705: [SPARK-25704][CORE][WIP] Allocate a bit less than...

2018-10-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22705#discussion_r224977082
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -195,7 +196,11 @@ object ChunkedByteBuffer {
 val is = new FileInputStream(file)
 ByteStreams.skipFully(is, offset)
 val in = new LimitedInputStream(is, length)
-val chunkSize = math.min(maxChunkSize, length).toInt
+// Though in theory you should be able to index into an array of size 
Int.MaxValue, in practice
+// jvms don't let you go up to limit.  It seems you may only need - 2, 
but we leave a little
+// extra room.
+val maxArraySize = Int.MaxValue - 512
--- End diff --

We already had seen the similar problem in other places. Can we use [this 
value](https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L52)
 here, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22711: [SPARK-25714][SQL][followup] improve the comment inside ...

2018-10-13 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22711
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22698: [SPARK-25710][SQL] range should report metrics correctly

2018-10-12 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22698
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22678: [SPARK-25685][BUILD] Allow running tests in Jenkins in e...

2018-10-12 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22678
  
LGTM, pending Jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22678: [SPARK-25685][BUILD] Allow running tests in Jenkins in e...

2018-10-12 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22678
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22678: [SPARK-25685][BUILD] Allow running tests in Jenkins in e...

2018-10-12 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22678
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...

2018-10-11 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22678#discussion_r224549100
  
--- Diff: docs/building-spark.md ---
@@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 
2.12.6):
 ./build/sbt -Dscala.version=2.12.6
 
 Otherwise, the sbt-pom-reader plugin will use the `scala.version` 
specified in the spark-parent pom.
+
+## Running Jenkins tests with Github Enterprise
+
+To run tests with Jenkins:
+
+./dev/run-tests-jenkins
+
+If use an individual repository or an GitHub Enterprise, export below 
environment variables before running above command.
+
+### Related environment variables
+
+
+Variable NameDefaultMeaning
+
+  GITHUB_API_BASE
+  https://api.github.com/repos/apache/spark
+  
+The GitHub server API URL. It could be pointed to an GitHub Enterprise.
--- End diff --

nit: `an` -> `a`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...

2018-10-11 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22678#discussion_r224549012
  
--- Diff: docs/building-spark.md ---
@@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 
2.12.6):
 ./build/sbt -Dscala.version=2.12.6
 
 Otherwise, the sbt-pom-reader plugin will use the `scala.version` 
specified in the spark-parent pom.
+
+## Running Jenkins tests with Github Enterprise
+
+To run tests with Jenkins:
+
+./dev/run-tests-jenkins
+
+If use an individual repository or an GitHub Enterprise, export below 
environment variables before running above command.
--- End diff --

nit: `an` -> `a`
In addition, how about `an GitHub Enterprise` -> `a repository on GitHub 
Enterprise`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22657: [SPARK-25670][TEST] Reduce number of tested timezones in...

2018-10-11 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22657
  
Sorry for bothering you again. Do we need to apply the same reduction to 
`CastSuite`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22657: [SPARK-25670][TEST] Reduce number of tested timezones in...

2018-10-11 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22657
  
How about defining this subset in `object DateTimeTestUtils` like 
`ALL_TIMEZONES`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 3585 matches

Mail list logo