[GitHub] spark issue #22160: Revert "[SPARK-24418][BUILD] Upgrade Scala to 2.11.12 an...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22160
  
@dbtsai Have you tried to run it in scala 2.12? We still can do the upgrade 
after Apache 2.4 RC. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
@mallman Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22160: Revert "[SPARK-24418][BUILD] Upgrade Scala to 2.11.12 an...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22160
  
Thank you! @dbtsai Let us see whether this can pass all the tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
Try this when `spark.sql.nestedSchemaPruning.enabled` is on?
```SQL
withTable("t1") {
  spark.sql(
"""
  |Create table t1 (`id` INT,`CoL1` STRING,
  |`coL2` STRUCT<`a`: TIMESTAMP, `b`: INT, `c`: ARRAY>,
  |`col3` INT,
  |`col4` INT,
  |`Col5` STRUCT<`a`: STRING, `b`: STRUCT<`a`: INT, `b`: 
ARRAY, `c`: INT>, `c`: INT>)
  |USING parquet
""".stripMargin)
  spark.sql("SELECT id from t1 where col2.b < 7").count()
}
```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211287978
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
--- End diff --

Also need a test case for checking enforceSchema works well when column 
pruning is on. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211283824
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -227,10 +210,9 @@ object TextInputCSVDataSource extends CSVDataSource {
   // Note: if there are only comments in the first block, the header 
would probably
   // be not extracted.
   CSVUtils.extractHeader(lines, parser.options).foreach { header =>
-CSVDataSource.checkHeader(
-  header,
-  parser.tokenizer,
-  dataSchema,
+CSVDataSource.checkHeaderColumnNames(
+  if (columnPruning) requiredSchema else dataSchema,
+  parser.tokenizer.parseLine(header),
--- End diff --

Nit: the following code style is preferred.
```
val schema = if (columnPruning) requiredSchema else dataSchema
val columnNames = parser.tokenizer.parseLine(header)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-18 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21909
  
LGTM.

Thanks for being patient to address all the comments! Merged to master. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...

2018-08-18 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22123
  
cc @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20226: [SPARK-23034][SQL] Override `nodeName` for all *ScanExec...

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20226
  
@maropu Could you take this over?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r211045699
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
 ---
@@ -223,7 +224,8 @@ object MultiLineJsonDataSource extends JsonDataSource {
   input => parser.parse[InputStream](input, streamParser, 
partitionedFileString),
   parser.options.parseMode,
   schema,
-  parser.options.columnNameOfCorruptRecord)
+  parser.options.columnNameOfCorruptRecord,
+  optimizeEmptySchema = false)
--- End diff --

Could we rename `optimizeEmptySchema ` to `isMultiLine`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r211045061
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1492,6 +1492,15 @@ object SQLConf {
 "This usually speeds up commands that need to list many 
directories.")
   .booleanConf
   .createWithDefault(true)
+
+  val BYPASS_PARSER_FOR_EMPTY_SCHEMA =
+buildConf("spark.sql.legacy.bypassParserForEmptySchema")
--- End diff --

If no behavior change, do we still need this conf?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22121
  
We also need to document the extra enhancements that are added in this 
release, compared with the databricks/spark-avro package. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22121
  
We should do the same thing for the other native sources. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r210970616
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,267 @@
+---
+layout: global
+title: Avro Data Source Guide
+---
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
support for reading and writing Avro data.
--- End diff --

`support` -> `built-in support`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22134
  
Do we need it in the current stage?  Regarding UX, it looks complex to end 
users. I am unable to remember the names. It is very easy to provide a wrong 
class name. 
```
"spark.sql.datasource.map.org.apache.spark.sql.avro" -> 
testDataSource.getCanonicalName,
"spark.sql.datasource.map.com.databricks.spark.csv" -> 
testDataSource.getCanonicalName,
"spark.sql.datasource.map.com.databricks.spark.avro" -> 
testDataSource.getCanonicalName) 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r210959550
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -637,6 +638,17 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
+} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
+  provider1 == "com.databricks.spark.avro" ||
+  provider1 == "org.apache.spark.sql.avro") {
+  throw new AnalysisException(
+s"Failed to find data source: $provider1. Avro is 
built-in data source since " +
+"Spark 2.4. Please deploy the application as per " +
+
s"$latestDocsURL/avro-data-source-guide.html#deploying")
--- End diff --

We can update the message later. No need to be blocked by that. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
@kiszk Please create a JIRA and we can post more ideas there. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22121
  
@gengliangwang Could you also post the screen shot in your PR description?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
@kiszk The initial prototype or proof of concept can be in any personal 
branch. When we merge it to the master branch, we still need to separate it 
from the current codegen and make it configurable. After the release, the users 
can choose which one to be used. When the new IR is stable, we can then 
consider deprecate the current one. This is majorly for product stability. We 
need to follow the similar principle for any big project. 

@viirya @mgaido91 Let us first focus on the new IR design and prototype. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r210767018
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2223,21 +2223,31 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
   }
 
-
   test("SPARK-23723: specified encoding is not matched to actual 
encoding") {
-val fileName = "test-data/utf16LE.json"
-val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
-val exception = intercept[SparkException] {
-  spark.read.schema(schema)
-.option("mode", "FAILFAST")
-.option("multiline", "true")
-.options(Map("encoding" -> "UTF-16BE"))
-.json(testFile(fileName))
-.count()
+def doCount(bypassParser: Boolean, multiLine: Boolean): Long = {
+  var result: Long = -1
+  withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> 
bypassParser.toString) {
+val fileName = "test-data/utf16LE.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+result = spark.read.schema(schema)
+  .option("mode", "FAILFAST")
--- End diff --

This sounds good! Let us enable it only when PERMISSIVE is on. You know, 
our default mode is PERMISSIVE. This should benefit most users. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r210765672
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1894,6 +1894,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - In version 2.3 and earlier, CSV rows are considered as malformed if at 
least one column value in the row is malformed. CSV parser dropped such rows in 
the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 
2.4, CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the "id,name" header and one row "1234". In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to 
`false`.
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
+  - Since Spark 2.4, text-based datasources like CSV and JSON don't parse 
input lines if the required schema pushed down to the datasources is empty. The 
schema can be empty in the case of the count() action. For example, Spark 2.3 
and earlier versions failed on JSON files with invalid encoding but Spark 2.4 
returns total number of lines in the file. To restore the previous behavior 
when the underlying parser is always invoked even for the empty schema, set 
`true` to `spark.sql.legacy.bypassParserForEmptySchema`. This option will be 
removed in Spark 3.0.
--- End diff --

Is it right based on what you said 
https://github.com/apache/spark/pull/21909#discussion_r210704902?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22108: [SPARK-25092][SQL][FOLLOWUP] Add RewriteCorrelatedScalar...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22108
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r210693829
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2223,21 +2223,31 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
   }
 
-
   test("SPARK-23723: specified encoding is not matched to actual 
encoding") {
-val fileName = "test-data/utf16LE.json"
-val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
-val exception = intercept[SparkException] {
-  spark.read.schema(schema)
-.option("mode", "FAILFAST")
-.option("multiline", "true")
-.options(Map("encoding" -> "UTF-16BE"))
-.json(testFile(fileName))
-.count()
+def doCount(bypassParser: Boolean, multiLine: Boolean): Long = {
+  var result: Long = -1
+  withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> 
bypassParser.toString) {
+val fileName = "test-data/utf16LE.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+result = spark.read.schema(schema)
+  .option("mode", "FAILFAST")
--- End diff --

Does the mode matter? What happened if users use `DROPMALFORMED ` before 
this PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r210666117
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1492,6 +1492,15 @@ object SQLConf {
 "This usually speeds up commands that need to list many 
directories.")
   .booleanConf
   .createWithDefault(true)
+
+  val BYPASS_PARSER_FOR_EMPTY_SCHEMA =
+buildConf("spark.sql.legacy.bypassParserForEmptySchema")
+  .doc("If required schema passed to a text datasource is empty, the 
parameter controls " +
+"invocation of underlying parser. For example, if it is set to 
false, uniVocity parser " +
+"is invoke by CSV datasource or Jackson parser by JSON datasource. 
By default, it is set " +
--- End diff --

`invoke` -> `invoked`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
@HyukjinKwon I am worrying about the design of a mixture of representation 
s"" and code""? When the design is not good, it is hard to maintain it and add 
new code based on this. Let @cloud-fan to decide whether we should revert it or 
not.

@mgaido91 The doc you posted is one of the problems the IR can resolve. 
However, we need a high-level and low-level design doc for building a new IR in 
codegen. 

@kiszk Thanks for leading this effort! You are the best candidate to lead 
this in the community. 

@viirya Thanks for your professional reply! Have a good rest. 

A general suggestion. To avoid introducing the regressions, how about 
implementing a new one without changing the existing one? We can easily switch 
to the new one when it becomes stable. Thanks all for making our codegen 
better! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...

2018-08-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21868#discussion_r210494497
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -425,12 +426,44 @@ case class FileSourceScanExec(
   fsRelation: HadoopFsRelation): RDD[InternalRow] = {
 val defaultMaxSplitBytes =
   fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
-val openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
+var openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
 val defaultParallelism = 
fsRelation.sparkSession.sparkContext.defaultParallelism
 val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + 
openCostInBytes)).sum
 val bytesPerCore = totalBytes / defaultParallelism
 
-val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
+var maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
+
+
if(fsRelation.sparkSession.sessionState.conf.isParquetSizeAdaptiveEnabled &&
+  (fsRelation.fileFormat.isInstanceOf[ParquetSource] ||
+fsRelation.fileFormat.isInstanceOf[OrcFileFormat])) {
+  if (relation.dataSchema.map(_.dataType).forall(dataType =>
+dataType.isInstanceOf[CalendarIntervalType] || 
dataType.isInstanceOf[StructType]
+  || dataType.isInstanceOf[MapType] || 
dataType.isInstanceOf[NullType]
+  || dataType.isInstanceOf[AtomicType] || 
dataType.isInstanceOf[ArrayType])) {
+
+def getTypeLength(dataType: DataType): Int = {
+  if (dataType.isInstanceOf[StructType]) {
+
fsRelation.sparkSession.sessionState.conf.parquetStructTypeLength
+  } else if (dataType.isInstanceOf[ArrayType]) {
+
fsRelation.sparkSession.sessionState.conf.parquetArrayTypeLength
+  } else if (dataType.isInstanceOf[MapType]) {
+fsRelation.sparkSession.sessionState.conf.parquetMapTypeLength
+  } else {
+dataType.defaultSize
+  }
+}
+
+val selectedColumnSize = 
requiredSchema.map(_.dataType).map(getTypeLength(_))
+  .reduceOption(_ + _).getOrElse(StringType.defaultSize)
+val totalColumnSize = 
relation.dataSchema.map(_.dataType).map(getTypeLength(_))
+  .reduceOption(_ + _).getOrElse(StringType.defaultSize)
--- End diff --

The type based estimation is very rough. This is still hard for end users 
to decide the initial size. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22119
  
For details, see the discussion in the JIRA 
https://issues.apache.org/jira/browse/SPARK-24924


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22111: [SPARK-25123][SQL] Use Block to track code in SimpleExpr...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22111
  
Let us hold these codegen PRs until we see the design doc for building IR 
for the codegen?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
We are fully swamped by the hotfix and regressions of 2.3 release and the 
new features that are targeting to 2.4. We should post some comments in this PR 
earlier. 

Designing an IR for our codegen is the right thing we should do. [If you do 
not agree on this, we can discuss about it.] How to design an IR is a 
challenging task. The whole community is welcome to submit the designs and PRs. 
Everyone can show the ideas. The best idea will win. @HyukjinKwon If you have a 
bandwidth, please also have a try


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
To Spark users, introducing AnalysisBarrier is a disaster. However, to the 
developers of Spark internal, this is just a bug. If you served the customers 
who are heavily using Spark, you will understand what I am talking about. It is 
even hard to debug when the Spark jobs are very complex. 

Normally, we never commit/merge any PR that is useless, especially when the 
PR changes are not tiny. Reverting this PRs are also very painful. That is why 
Reynold took a few days to finish it. It is not a fun job for him to rewrite it.

Based on the current work, I can expect there are hundreds of PRs that will 
be submitted for changing the codegen templates and polishing the current code. 
The reason why we did not merge this PR is that we are doubting this is the 
right thing to do. @rednaxelafx 

I am not saying @viirya and @mgaido91 did a bad job to submit many PRs to 
improve the existing one. However, we need to think of the fundamental problems 
we are solving in the codegen. Instead of reinventing a compiler, how about 
letting the compiler internal expert (in our community, we have @kiszk) to lead 
the effort and offer a design for this. Coding and designing are different 
issues. If possible, we need to find the best person to drive it. If @viirya 
and @mgaido91 think they are familiar with compiler internal, I am also glad to 
see the designs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
I am fine to not revert it since it is too late. So many related PRs have 
been merged, but we need to seriously consider writing and reviewing the design 
docs before changing the code generation framework of Spark SQL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
AnalysisBarrier does not introduce a behavior change. However, this 
requires our analyzer rules must be idempotent. The most recent correctness bug 
also shows another big potential hole 
https://issues.apache.org/jira/browse/SPARK-25051. It could break Analyzer 
rules without notice, since AnalysisBarrier is a leaf node. Basically, this is 
a disaster for Spark 2.3 release. You can count how many regressions we found 
after the release. This is bad. Unfortunately, I made the merge without enough 
consideration. If possible, I would revert that PR, but it is too late. 

Now, we are facing the similar issue again. We should stop continuing this 
direction without a proper design and review. Designing/enhancing the codegen 
framework requires more inputs from the experts in this area. I do not think 
the current design is well discussed, even if I saw some discussions in the 
initial PR. I am not the expert in this area. That is why I pinged @kiszk to 
drive this. 

BTW, the internal APIs we can change easily, but the design need a review 
especially for the core components of Spark SQL. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
This is not related to the credit or not. I think we need to introduce an 
IR like what the compiler is doing instead of continuous improvement on the 
existing one, which is already very hacky and hard to debug and maintain. We 
are wasting the efforts and resources if we follow the current direction. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
This is bad. The design needs to be carefully reviewed before implementing 
it. Basically, we breaks the basic principles of software engineering. It is 
very strange to write the design doc after introducing such a fundamental API 
change. 

I think we need more expertise in the existing compiler internal experts. 
That is why I proposed to let @kiszk to lead this. We are not inventing 
anything new in our codegen framework. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
@kiszk You are a JVM expert with a very strong IR background. Could you 
lead the efforts and drive the IR design?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
Another example is the AnalysisBarrier, which becomes a disaster to Spark 
2.3 release. Many blockers, correctness bugs, performance regressions are 
caused by that. Thus, I think we should revert this PR until we have a 
high-level design doc and roadmap for the codegen. For example, introducing IR. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
Based on this PR, so many changes will be made in the codegen. The codegen 
is very fundamental to Spark SQL. I do not think we should merge this PR at 
this stage. To be more disciplined, we need a design doc for these API level 
changes in the codegen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22007: [SPARK-25033] Bump Apache commons.{httpclient, httpcore}

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22007
  
The bump is fine but this is not for making it congruent with Stocator, 
which is just an external connector. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17185
  
Let us discuss it in the JIRA 
https://issues.apache.org/jira/browse/SPARK-25121


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17185
  
@skambha @dilipbiswal How about the hint resolution after supporting multi 
part names?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17400: [SPARK-19981][SQL] Update output partitioning info. when...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17400
  
@maropu We should fix it. How about doing it after the code freeze? So far, 
all are swamped by different tasks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22107: [SPARK-25117] Add EXEPT ALL and INTERSECT ALL support in...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22107
  
cc @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22102: [SPARK-25051][SQL] FixNullability should not stop...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22102#discussion_r210053264
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1704,6 +1704,7 @@ class Analyzer(
 
 def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case p if !p.resolved => p // Skip unresolved nodes.
+  case ab: AnalysisBarrier => apply(ab.child)
--- End diff --

AnalysisBarrier is a leaf node. We still need to apply this rule. A better 
fix is to fix the rule. However, for backporting the fix, the risk of this fix 
is low


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21123: [SPARK-24045][SQL]Create base class for file data source...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21123
  
@HyukjinKwon @rdblue V2 data source APIs should not break FileFormat V1 
compatibility, IMO. Based on my experience, it is not a right thing to force 
users to change their Spark applications for using the data sources using V2  
APIs. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22104
  
@icexelloss Do we face the same issue for DataSourceStrategy?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22104#discussion_r210044089
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3367,6 +3367,24 @@ def test_ignore_column_of_all_nulls(self):
 finally:
 shutil.rmtree(path)
 
+def test_datasource_with_udf_filter_lit_input(self):
--- End diff --

Add another test case for arrow-based pandas udf?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22101: [SPARK-25114][Core] Fix RecordBinaryComparator wh...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22101#discussion_r210043412
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
 ---
@@ -27,7 +27,6 @@
   public int compare(
--- End diff --

Let us add a test suite for this function as you mentioned in the line 25. 
: )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22102: [SPARK-25051][SQL] FixNullability should not stop on Ana...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22102
  
@mgaido91 The PR is merged. Could you close it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22102: [SPARK-25051][SQL] FixNullability should not stop...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22102#discussion_r210036342
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2300,4 +2300,10 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(aggPlusFilter1, aggPlusFilter2.collect())
 }
   }
+
+  test("SPARK-25051: fix nullabilities of outer join attributes doesn't 
stop on AnalysisBarrier") {
+val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name")
+val df2 = spark.range(3).selectExpr("id")
+assert(df1.join(df2, Seq("id"), 
"left_outer").where(df2("id").isNull).collect().length == 1)
--- End diff --

Just general suggestion. We should compare the results


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22102: [SPARK-25051][SQL] FixNullability should not stop on Ana...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22102
  
Let me confirm it and then will merge it to 2.3 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22102: [SPARK-25051][SQL] FixNullability should not stop on Ana...

2018-08-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22102
  
This might not be the last one. Let us backport the fix of @maryannxue ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs

2018-08-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22082
  
@dbtsai Nope. I did not hit any issue. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
@BryanCutler @shaneknapp Thanks for your work!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-08-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21439
  
LGTM 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21439: [SPARK-24391][SQL] Support arrays of any types by...

2018-08-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21439#discussion_r209461516
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql 
---
@@ -39,3 +39,8 @@ select from_json('{"a":1, "b":"2"}', 
'struct');
 -- infer schema of json literal
 select schema_of_json('{"c1":0, "c2":[1]}');
 select from_json('{"c1":[1, 2, 3]}', schema_of_json('{"c1":[0]}'));
+
+-- from_json - array type
+select from_json('[1, 2, 3]', 'array');
--- End diff --

Add more cases ?
select from_json('[3, null, 4]', 'array')
select from_json('[3, "str", 4]', 'array')


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21439: [SPARK-24391][SQL] Support arrays of any types by...

2018-08-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21439#discussion_r209461334
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -101,6 +102,21 @@ class JacksonParser(
 }
   }
 
+  private def makeArrayRootConverter(at: ArrayType): JsonParser => 
Seq[InternalRow] = {
+val elemConverter = makeConverter(at.elementType)
+(parser: JsonParser) => parseJsonToken[Seq[InternalRow]](parser, at) {
+  case START_ARRAY => Seq(InternalRow(convertArray(parser, 
elemConverter)))
+  case START_OBJECT if at.elementType.isInstanceOf[StructType] =>
+// This handles the case when an input JSON object is a structure 
but
+// the specified schema is an array of structures. In that case, 
the input JSON is
--- End diff --

Could you add an example here, like what we did in `makeStructRootConverter 
`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22082
  
cc @dbtsai @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs

2018-08-11 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/22082

[SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs

## What changes were proposed in this pull request?
Use ASM 6 APIs after we upgrading it to ASM6.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark asm6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22082.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22082


commit 266650006ed1f5d19d6eaf24d7058ca341457039
Author: Xiao Li 
Date:   2018-08-12T04:13:20Z

use asm6 apis




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22079
  
The original fix 
https://github.com/apache/spark/pull/22079/commits/efccc028bce64bf4754ce81ee16533c19b4384b2
 has been merged to Spark 2.3. After 5+ months, we have not received any 
correctness regression that is caused by this fix, although this fix definitely 
will introduce a performance regression. I think we should merge it. The 
general risk is not very high and it resolves a serious correctness bug. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22079
  
cc @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22072: [SPARK-25081][Core]Nested spill in ShuffleExternalSorter...

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22072
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22072: [SPARK-25081][Core]Nested spill in ShuffleExternalSorter...

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22072
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22077: [SPARK-25084][SQL][BACKPORT-2.3] "distribute by" on mult...

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22077
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22077: [SPARK-25084][SQL][BACKPORT-2.3] "distribute by" on mult...

2018-08-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22077
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21977
  
Why not using `resource.RLIMIT_RSS`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21889
  
retest this please



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
retest this please



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r209077023
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -580,6 +580,11 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
+.doc("The maximum number of buckets allowed. Defaults to 10")
+.intConf
--- End diff --

`.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets 
must be larger than 0")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r209076282
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -580,6 +580,11 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
--- End diff --

Make it consistent with `spark.sql.sources.bucketing.enabled`? rename it to 
`spark.sql.sources.bucketing.maxBuckets`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22060: [DO NOT MERGE][TEST ONLY] Add once-policy rule check

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22060
  
`AliasViewChild` is the only rule? Can we whitelist it first? It sounds 
like many tests are skipped. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21889
  
retest this please



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
retest this please



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21608
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21889
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22049: [SPARK-25063][SQL] Rename class KnowNotNull to KnownNotN...

2018-08-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22049
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21994: [SPARK-24529][Build][test-maven][follow-up] Add s...

2018-08-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21994#discussion_r20881
  
--- Diff: pom.xml ---
@@ -2609,6 +2609,28 @@
   
 
   
+  
+com.github.spotbugs
+spotbugs-maven-plugin
--- End diff --

Adding spotbugs slows down my local build a lot. Can we hold this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
Really thank you for your help! @shaneknapp 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21889
  
I hit the following error in my local environment. 
```
sbt.ForkMain$ForkError: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 220.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 220.0 (TID 465, localhost, executor driver): 
java.lang.IllegalArgumentException: Length -67059888 and offset 
140049531604288must be non-negative
at 
org.apache.spark.unsafe.memory.MemoryBlock.(MemoryBlock.java:64)
at 
org.apache.spark.unsafe.memory.OffHeapMemoryBlock.(OffHeapMemoryBlock.java:26)
at 
org.apache.spark.sql.execution.vectorized.OffHeapColumnVector.getBytesAsUTF8String(OffHeapColumnVector.java:221)
at 
org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:382)
at 
org.apache.spark.sql.vectorized.ColumnarArray.getUTF8String(ColumnarArray.java:127)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:130)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:406)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

Could you turn on the flag in the PR? I want to trigger the tests multiple 
times in the PR? @ajacques 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
@shaneknapp That is great! I think making Jenkins in a quite mode looks 
fine, as long as we send out a note to the dev list. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  

https://lists.apache.org/thread.html/5b0836e44f9386fae2f99deed0a01441c699040c991d833faf520357@%3Cdev.arrow.apache.org%3E

Arrow 0.10.0 release is officially announced. @shaneknapp Could you help 
pull the trigger on the update to arrow?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
We need to upgrade it to 0.10.0 in this release, if possible. It resolves 
some bugs, e.g., https://issues.apache.org/jira/browse/ARROW-1973. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20611#discussion_r208412882
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -1976,6 +1976,49 @@ private[spark] object Utils extends Logging {
 }
   }
 
+  def makeQualified(defaultUri: URI, workingDir: Path, path: Path): Path = 
{
--- End diff --

This is a utility function. We normally need to add function comments and 
param description. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21596
  
@jerryshao This is for 3.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21608: [SPARK-24626] [SQL] Improve location size calcula...

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21608#discussion_r208408858
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -49,4 +51,11 @@ object DataSourceUtils {
   }
 }
   }
+
+  // SPARK-15895: Metadata files (e.g. Parquet summary files) and 
temporary files should not be
+  // counted as data files, so that they shouldn't participate partition 
discovery.
+  private[sql] def isDataPath(path: Path): Boolean = {
+val name = path.getName
+!((name.startsWith("_") && !name.contains("=")) || 
name.startsWith("."))
--- End diff --

Not sure what is your earlier impl. I would prefer to keeping unchanged the 
original code in `PartitioningAwareFileIndex.scala`. Just add a utility 
function `isDataPath ` in CommandUtils.scala. Does this sound good to you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22028: [SPARK-25046][SQL] Fix Alter View can excute sql like "A...

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22028
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21977
  
cc @jiangxb1987 @cloud-fan @jerryshao @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21977
  
@rdblue Is this for YARN only?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21977
  
test cases?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22006: [SPARK-25031][SQL] Fix MapType schema print

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22006
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22028: [SPARK-25046][SQL] Fix Alter View can excute sql like "A...

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22028
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22030: [SPARK-25048][SQL] Pivoting by multiple columns i...

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22030#discussion_r208326382
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -403,20 +415,29 @@ class RelationalGroupedDataset protected[sql](
*
* {{{
*   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
+   *   df.groupBy($"year").pivot($"course", Seq(lit("dotNET"), 
lit("Java"))).sum($"earnings")
+   * }}}
+   *
+   * For pivoting by multiple columns, use the `struct` function to 
combine the columns and values:
+   *
+   * {{{
+   *   df
+   * .groupBy($"year")
+   * .pivot(struct($"course", $"training"), Seq(struct(lit("java"), 
lit("Experts"
+   * .agg(sum($"earnings"))
* }}}
*
* @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
* @since 2.4.0
*/
-  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Column]): 
RelationalGroupedDataset = {
--- End diff --

@HyukjinKwon I think this change is better than what 
https://github.com/apache/spark/pull/21699 did. 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21608: [SPARK-24626] [SQL] Improve location size calcula...

2018-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21608#discussion_r208269963
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -49,4 +51,11 @@ object DataSourceUtils {
   }
 }
   }
+
+  // SPARK-15895: Metadata files (e.g. Parquet summary files) and 
temporary files should not be
+  // counted as data files, so that they shouldn't participate partition 
discovery.
+  private[sql] def isDataPath(path: Path): Boolean = {
+val name = path.getName
+!((name.startsWith("_") && !name.contains("=")) || 
name.startsWith("."))
--- End diff --

Let us do not use the same one with `PartitioningAwareFileIndex.scala`. In 
this case, the data file is not related to the following condition 
`name.contains("=")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21970: [SPARK-24996][SQL] Use DSL in DeclarativeAggregate

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21970
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22012: [SPARK-25036][SQL] Should compare ExprValue.isNull with ...

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22012
  
Thanks! Merged to master. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
It sounds like the vote can pass soon. 
https://lists.apache.org/thread.html/9900da1540be5aafce27691fd40395bb53f465302db29979c154d99a@%3Cdev.arrow.apache.org%3E


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
To get this in, we might need to delay the code freeze. Can you reply the 
dev list email  
http://apache-spark-developers-list.1001551.n3.nabble.com/code-freeze-and-branch-cut-for-Apache-Spark-2-4-td24365.html
 ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21939
  
After the code freeze, the dependency changes are not allowed. Hopefully, 
we can make it before that. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r207850329
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2225,19 +2225,21 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 
 
   test("SPARK-23723: specified encoding is not matched to actual 
encoding") {
-val fileName = "test-data/utf16LE.json"
-val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
-val exception = intercept[SparkException] {
-  spark.read.schema(schema)
-.option("mode", "FAILFAST")
-.option("multiline", "true")
-.options(Map("encoding" -> "UTF-16BE"))
-.json(testFile(fileName))
-.count()
-}
-val errMsg = exception.getMessage
+withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> "false") {
--- End diff --

How about CSV? Could you add the same one too?

Also, we need to add the verification logic when the conf is true. 
```
Seq(true, false).foreach { optimizeEmptySchema =>
  withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> 
optimizeEmptySchema.toString) {
  ...
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21970: [SPARK-24996][SQL] Use DSL in DeclarativeAggregate

2018-08-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21970
  
LGTM pending Jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    2   3   4   5   6   7   8   9   10   11   >