[GitHub] spark issue #22160: Revert "[SPARK-24418][BUILD] Upgrade Scala to 2.11.12 an...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22160 @dbtsai Have you tried to run it in scala 2.12? We still can do the upgrade after Apache 2.4 RC. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 @mallman Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22160: Revert "[SPARK-24418][BUILD] Upgrade Scala to 2.11.12 an...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22160 Thank you! @dbtsai Let us see whether this can pass all the tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 Try this when `spark.sql.nestedSchemaPruning.enabled` is on? ```SQL withTable("t1") { spark.sql( """ |Create table t1 (`id` INT,`CoL1` STRING, |`coL2` STRUCT<`a`: TIMESTAMP, `b`: INT, `c`: ARRAY>, |`col3` INT, |`col4` INT, |`Col5` STRUCT<`a`: STRING, `b`: STRUCT<`a`: INT, `b`: ARRAY, `c`: INT>, `c`: INT>) |USING parquet """.stripMargin) spark.sql("SELECT id from t1 where col2.b < 7").count() } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211287978 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { --- End diff -- Also need a test case for checking enforceSchema works well when column pruning is on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211283824 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -227,10 +210,9 @@ object TextInputCSVDataSource extends CSVDataSource { // Note: if there are only comments in the first block, the header would probably // be not extracted. CSVUtils.extractHeader(lines, parser.options).foreach { header => -CSVDataSource.checkHeader( - header, - parser.tokenizer, - dataSchema, +CSVDataSource.checkHeaderColumnNames( + if (columnPruning) requiredSchema else dataSchema, + parser.tokenizer.parseLine(header), --- End diff -- Nit: the following code style is preferred. ``` val schema = if (columnPruning) requiredSchema else dataSchema val columnNames = parser.tokenizer.parseLine(header) ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21909 LGTM. Thanks for being patient to address all the comments! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22123 cc @MaxGekk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL] Override `nodeName` for all *ScanExec...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20226 @maropu Could you take this over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r211045699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -223,7 +224,8 @@ object MultiLineJsonDataSource extends JsonDataSource { input => parser.parse[InputStream](input, streamParser, partitionedFileString), parser.options.parseMode, schema, - parser.options.columnNameOfCorruptRecord) + parser.options.columnNameOfCorruptRecord, + optimizeEmptySchema = false) --- End diff -- Could we rename `optimizeEmptySchema ` to `isMultiLine`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r211045061 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1492,6 +1492,15 @@ object SQLConf { "This usually speeds up commands that need to list many directories.") .booleanConf .createWithDefault(true) + + val BYPASS_PARSER_FOR_EMPTY_SCHEMA = +buildConf("spark.sql.legacy.bypassParserForEmptySchema") --- End diff -- If no behavior change, do we still need this conf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22121 We also need to document the extra enhancements that are added in this release, compared with the databricks/spark-avro package. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22121 We should do the same thing for the other native sources. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22121#discussion_r210970616 --- Diff: docs/avro-data-source-guide.md --- @@ -0,0 +1,267 @@ +--- +layout: global +title: Avro Data Source Guide +--- + +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data. --- End diff -- `support` -> `built-in support` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22134 Do we need it in the current stage? Regarding UX, it looks complex to end users. I am unable to remember the names. It is very easy to provide a wrong class name. ``` "spark.sql.datasource.map.org.apache.spark.sql.avro" -> testDataSource.getCanonicalName, "spark.sql.datasource.map.com.databricks.spark.csv" -> testDataSource.getCanonicalName, "spark.sql.datasource.map.com.databricks.spark.avro" -> testDataSource.getCanonicalName) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22133#discussion_r210959550 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -637,6 +638,17 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") +} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || + provider1 == "com.databricks.spark.avro" || + provider1 == "org.apache.spark.sql.avro") { + throw new AnalysisException( +s"Failed to find data source: $provider1. Avro is built-in data source since " + +"Spark 2.4. Please deploy the application as per " + + s"$latestDocsURL/avro-data-source-guide.html#deploying") --- End diff -- We can update the message later. No need to be blocked by that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 @kiszk Please create a JIRA and we can post more ideas there. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22121 @gengliangwang Could you also post the screen shot in your PR description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 @kiszk The initial prototype or proof of concept can be in any personal branch. When we merge it to the master branch, we still need to separate it from the current codegen and make it configurable. After the release, the users can choose which one to be used. When the new IR is stable, we can then consider deprecate the current one. This is majorly for product stability. We need to follow the similar principle for any big project. @viirya @mgaido91 Let us first focus on the new IR design and prototype. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r210767018 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2223,21 +2223,31 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { checkAnswer(jsonDF, Seq(Row("Chris", "Baird"))) } - test("SPARK-23723: specified encoding is not matched to actual encoding") { -val fileName = "test-data/utf16LE.json" -val schema = new StructType().add("firstName", StringType).add("lastName", StringType) -val exception = intercept[SparkException] { - spark.read.schema(schema) -.option("mode", "FAILFAST") -.option("multiline", "true") -.options(Map("encoding" -> "UTF-16BE")) -.json(testFile(fileName)) -.count() +def doCount(bypassParser: Boolean, multiLine: Boolean): Long = { + var result: Long = -1 + withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> bypassParser.toString) { +val fileName = "test-data/utf16LE.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +result = spark.read.schema(schema) + .option("mode", "FAILFAST") --- End diff -- This sounds good! Let us enable it only when PERMISSIVE is on. You know, our default mode is PERMISSIVE. This should benefit most users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r210765672 --- Diff: docs/sql-programming-guide.md --- @@ -1894,6 +1894,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. + - Since Spark 2.4, text-based datasources like CSV and JSON don't parse input lines if the required schema pushed down to the datasources is empty. The schema can be empty in the case of the count() action. For example, Spark 2.3 and earlier versions failed on JSON files with invalid encoding but Spark 2.4 returns total number of lines in the file. To restore the previous behavior when the underlying parser is always invoked even for the empty schema, set `true` to `spark.sql.legacy.bypassParserForEmptySchema`. This option will be removed in Spark 3.0. --- End diff -- Is it right based on what you said https://github.com/apache/spark/pull/21909#discussion_r210704902? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22108: [SPARK-25092][SQL][FOLLOWUP] Add RewriteCorrelatedScalar...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22108 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r210693829 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2223,21 +2223,31 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { checkAnswer(jsonDF, Seq(Row("Chris", "Baird"))) } - test("SPARK-23723: specified encoding is not matched to actual encoding") { -val fileName = "test-data/utf16LE.json" -val schema = new StructType().add("firstName", StringType).add("lastName", StringType) -val exception = intercept[SparkException] { - spark.read.schema(schema) -.option("mode", "FAILFAST") -.option("multiline", "true") -.options(Map("encoding" -> "UTF-16BE")) -.json(testFile(fileName)) -.count() +def doCount(bypassParser: Boolean, multiLine: Boolean): Long = { + var result: Long = -1 + withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> bypassParser.toString) { +val fileName = "test-data/utf16LE.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +result = spark.read.schema(schema) + .option("mode", "FAILFAST") --- End diff -- Does the mode matter? What happened if users use `DROPMALFORMED ` before this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r210666117 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1492,6 +1492,15 @@ object SQLConf { "This usually speeds up commands that need to list many directories.") .booleanConf .createWithDefault(true) + + val BYPASS_PARSER_FOR_EMPTY_SCHEMA = +buildConf("spark.sql.legacy.bypassParserForEmptySchema") + .doc("If required schema passed to a text datasource is empty, the parameter controls " + +"invocation of underlying parser. For example, if it is set to false, uniVocity parser " + +"is invoke by CSV datasource or Jackson parser by JSON datasource. By default, it is set " + --- End diff -- `invoke` -> `invoked` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 @HyukjinKwon I am worrying about the design of a mixture of representation s"" and code""? When the design is not good, it is hard to maintain it and add new code based on this. Let @cloud-fan to decide whether we should revert it or not. @mgaido91 The doc you posted is one of the problems the IR can resolve. However, we need a high-level and low-level design doc for building a new IR in codegen. @kiszk Thanks for leading this effort! You are the best candidate to lead this in the community. @viirya Thanks for your professional reply! Have a good rest. A general suggestion. To avoid introducing the regressions, how about implementing a new one without changing the existing one? We can easily switch to the new one when it becomes stable. Thanks all for making our codegen better! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210494497 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -425,12 +426,44 @@ case class FileSourceScanExec( fsRelation: HadoopFsRelation): RDD[InternalRow] = { val defaultMaxSplitBytes = fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes -val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes +var openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum val bytesPerCore = totalBytes / defaultParallelism -val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) +var maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) + + if(fsRelation.sparkSession.sessionState.conf.isParquetSizeAdaptiveEnabled && + (fsRelation.fileFormat.isInstanceOf[ParquetSource] || +fsRelation.fileFormat.isInstanceOf[OrcFileFormat])) { + if (relation.dataSchema.map(_.dataType).forall(dataType => +dataType.isInstanceOf[CalendarIntervalType] || dataType.isInstanceOf[StructType] + || dataType.isInstanceOf[MapType] || dataType.isInstanceOf[NullType] + || dataType.isInstanceOf[AtomicType] || dataType.isInstanceOf[ArrayType])) { + +def getTypeLength(dataType: DataType): Int = { + if (dataType.isInstanceOf[StructType]) { + fsRelation.sparkSession.sessionState.conf.parquetStructTypeLength + } else if (dataType.isInstanceOf[ArrayType]) { + fsRelation.sparkSession.sessionState.conf.parquetArrayTypeLength + } else if (dataType.isInstanceOf[MapType]) { +fsRelation.sparkSession.sessionState.conf.parquetMapTypeLength + } else { +dataType.defaultSize + } +} + +val selectedColumnSize = requiredSchema.map(_.dataType).map(getTypeLength(_)) + .reduceOption(_ + _).getOrElse(StringType.defaultSize) +val totalColumnSize = relation.dataSchema.map(_.dataType).map(getTypeLength(_)) + .reduceOption(_ + _).getOrElse(StringType.defaultSize) --- End diff -- The type based estimation is very rough. This is still hard for end users to decide the initial size. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22119 For details, see the discussion in the JIRA https://issues.apache.org/jira/browse/SPARK-24924 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22111: [SPARK-25123][SQL] Use Block to track code in SimpleExpr...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22111 Let us hold these codegen PRs until we see the design doc for building IR for the codegen? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 We are fully swamped by the hotfix and regressions of 2.3 release and the new features that are targeting to 2.4. We should post some comments in this PR earlier. Designing an IR for our codegen is the right thing we should do. [If you do not agree on this, we can discuss about it.] How to design an IR is a challenging task. The whole community is welcome to submit the designs and PRs. Everyone can show the ideas. The best idea will win. @HyukjinKwon If you have a bandwidth, please also have a try --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 To Spark users, introducing AnalysisBarrier is a disaster. However, to the developers of Spark internal, this is just a bug. If you served the customers who are heavily using Spark, you will understand what I am talking about. It is even hard to debug when the Spark jobs are very complex. Normally, we never commit/merge any PR that is useless, especially when the PR changes are not tiny. Reverting this PRs are also very painful. That is why Reynold took a few days to finish it. It is not a fun job for him to rewrite it. Based on the current work, I can expect there are hundreds of PRs that will be submitted for changing the codegen templates and polishing the current code. The reason why we did not merge this PR is that we are doubting this is the right thing to do. @rednaxelafx I am not saying @viirya and @mgaido91 did a bad job to submit many PRs to improve the existing one. However, we need to think of the fundamental problems we are solving in the codegen. Instead of reinventing a compiler, how about letting the compiler internal expert (in our community, we have @kiszk) to lead the effort and offer a design for this. Coding and designing are different issues. If possible, we need to find the best person to drive it. If @viirya and @mgaido91 think they are familiar with compiler internal, I am also glad to see the designs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 I am fine to not revert it since it is too late. So many related PRs have been merged, but we need to seriously consider writing and reviewing the design docs before changing the code generation framework of Spark SQL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 AnalysisBarrier does not introduce a behavior change. However, this requires our analyzer rules must be idempotent. The most recent correctness bug also shows another big potential hole https://issues.apache.org/jira/browse/SPARK-25051. It could break Analyzer rules without notice, since AnalysisBarrier is a leaf node. Basically, this is a disaster for Spark 2.3 release. You can count how many regressions we found after the release. This is bad. Unfortunately, I made the merge without enough consideration. If possible, I would revert that PR, but it is too late. Now, we are facing the similar issue again. We should stop continuing this direction without a proper design and review. Designing/enhancing the codegen framework requires more inputs from the experts in this area. I do not think the current design is well discussed, even if I saw some discussions in the initial PR. I am not the expert in this area. That is why I pinged @kiszk to drive this. BTW, the internal APIs we can change easily, but the design need a review especially for the core components of Spark SQL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 This is not related to the credit or not. I think we need to introduce an IR like what the compiler is doing instead of continuous improvement on the existing one, which is already very hacky and hard to debug and maintain. We are wasting the efforts and resources if we follow the current direction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 This is bad. The design needs to be carefully reviewed before implementing it. Basically, we breaks the basic principles of software engineering. It is very strange to write the design doc after introducing such a fundamental API change. I think we need more expertise in the existing compiler internal experts. That is why I proposed to let @kiszk to lead this. We are not inventing anything new in our codegen framework. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 @kiszk You are a JVM expert with a very strong IR background. Could you lead the efforts and drive the IR design? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 Another example is the AnalysisBarrier, which becomes a disaster to Spark 2.3 release. Many blockers, correctness bugs, performance regressions are caused by that. Thus, I think we should revert this PR until we have a high-level design doc and roadmap for the codegen. For example, introducing IR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 Based on this PR, so many changes will be made in the codegen. The codegen is very fundamental to Spark SQL. I do not think we should merge this PR at this stage. To be more disciplined, we need a design doc for these API level changes in the codegen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22007: [SPARK-25033] Bump Apache commons.{httpclient, httpcore}
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22007 The bump is fine but this is not for making it congruent with Stocator, which is just an external connector. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17185 Let us discuss it in the JIRA https://issues.apache.org/jira/browse/SPARK-25121 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17185 @skambha @dilipbiswal How about the hint resolution after supporting multi part names? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17400: [SPARK-19981][SQL] Update output partitioning info. when...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17400 @maropu We should fix it. How about doing it after the code freeze? So far, all are swamped by different tasks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22107: [SPARK-25117] Add EXEPT ALL and INTERSECT ALL support in...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22107 cc @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22102: [SPARK-25051][SQL] FixNullability should not stop...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22102#discussion_r210053264 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1704,6 +1704,7 @@ class Analyzer( def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case p if !p.resolved => p // Skip unresolved nodes. + case ab: AnalysisBarrier => apply(ab.child) --- End diff -- AnalysisBarrier is a leaf node. We still need to apply this rule. A better fix is to fix the rule. However, for backporting the fix, the risk of this fix is low --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21123: [SPARK-24045][SQL]Create base class for file data source...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21123 @HyukjinKwon @rdblue V2 data source APIs should not break FileFormat V1 compatibility, IMO. Based on my experience, it is not a right thing to force users to change their Spark applications for using the data sources using V2 APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22104 @icexelloss Do we face the same issue for DataSourceStrategy? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22104#discussion_r210044089 --- Diff: python/pyspark/sql/tests.py --- @@ -3367,6 +3367,24 @@ def test_ignore_column_of_all_nulls(self): finally: shutil.rmtree(path) +def test_datasource_with_udf_filter_lit_input(self): --- End diff -- Add another test case for arrow-based pandas udf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22101: [SPARK-25114][Core] Fix RecordBinaryComparator wh...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r210043412 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -27,7 +27,6 @@ public int compare( --- End diff -- Let us add a test suite for this function as you mentioned in the line 25. : ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22102: [SPARK-25051][SQL] FixNullability should not stop on Ana...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22102 @mgaido91 The PR is merged. Could you close it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22102: [SPARK-25051][SQL] FixNullability should not stop...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22102#discussion_r210036342 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2300,4 +2300,10 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { checkAnswer(aggPlusFilter1, aggPlusFilter2.collect()) } } + + test("SPARK-25051: fix nullabilities of outer join attributes doesn't stop on AnalysisBarrier") { +val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name") +val df2 = spark.range(3).selectExpr("id") +assert(df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).collect().length == 1) --- End diff -- Just general suggestion. We should compare the results --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22102: [SPARK-25051][SQL] FixNullability should not stop on Ana...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22102 Let me confirm it and then will merge it to 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22102: [SPARK-25051][SQL] FixNullability should not stop on Ana...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22102 This might not be the last one. Let us backport the fix of @maryannxue ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22082 @dbtsai Nope. I did not hit any issue. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 @BryanCutler @shaneknapp Thanks for your work! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21439 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21439: [SPARK-24391][SQL] Support arrays of any types by...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21439#discussion_r209461516 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -39,3 +39,8 @@ select from_json('{"a":1, "b":"2"}', 'struct'); -- infer schema of json literal select schema_of_json('{"c1":0, "c2":[1]}'); select from_json('{"c1":[1, 2, 3]}', schema_of_json('{"c1":[0]}')); + +-- from_json - array type +select from_json('[1, 2, 3]', 'array'); --- End diff -- Add more cases ? select from_json('[3, null, 4]', 'array') select from_json('[3, "str", 4]', 'array') --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21439: [SPARK-24391][SQL] Support arrays of any types by...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21439#discussion_r209461334 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -101,6 +102,21 @@ class JacksonParser( } } + private def makeArrayRootConverter(at: ArrayType): JsonParser => Seq[InternalRow] = { +val elemConverter = makeConverter(at.elementType) +(parser: JsonParser) => parseJsonToken[Seq[InternalRow]](parser, at) { + case START_ARRAY => Seq(InternalRow(convertArray(parser, elemConverter))) + case START_OBJECT if at.elementType.isInstanceOf[StructType] => +// This handles the case when an input JSON object is a structure but +// the specified schema is an array of structures. In that case, the input JSON is --- End diff -- Could you add an example here, like what we did in `makeStructRootConverter `? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22082 cc @dbtsai @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/22082 [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs ## What changes were proposed in this pull request? Use ASM 6 APIs after we upgrading it to ASM6. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark asm6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22082.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22082 commit 266650006ed1f5d19d6eaf24d7058ca341457039 Author: Xiao Li Date: 2018-08-12T04:13:20Z use asm6 apis --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22079 The original fix https://github.com/apache/spark/pull/22079/commits/efccc028bce64bf4754ce81ee16533c19b4384b2 has been merged to Spark 2.3. After 5+ months, we have not received any correctness regression that is caused by this fix, although this fix definitely will introduce a performance regression. I think we should merge it. The general risk is not very high and it resolves a serious correctness bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22079 cc @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22072: [SPARK-25081][Core]Nested spill in ShuffleExternalSorter...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22072 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22072: [SPARK-25081][Core]Nested spill in ShuffleExternalSorter...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22072 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22077: [SPARK-25084][SQL][BACKPORT-2.3] "distribute by" on mult...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22077 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22077: [SPARK-25084][SQL][BACKPORT-2.3] "distribute by" on mult...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22077 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21977 Why not using `resource.RLIMIT_RSS`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21889 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r209077023 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -580,6 +580,11 @@ object SQLConf { .booleanConf .createWithDefault(true) + val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") +.doc("The maximum number of buckets allowed. Defaults to 10") +.intConf --- End diff -- `.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be larger than 0")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r209076282 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -580,6 +580,11 @@ object SQLConf { .booleanConf .createWithDefault(true) + val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") --- End diff -- Make it consistent with `spark.sql.sources.bucketing.enabled`? rename it to `spark.sql.sources.bucketing.maxBuckets`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22060: [DO NOT MERGE][TEST ONLY] Add once-policy rule check
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22060 `AliasViewChild` is the only rule? Can we whitelist it first? It sounds like many tests are skipped. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21889 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21608 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21889 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22049: [SPARK-25063][SQL] Rename class KnowNotNull to KnownNotN...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22049 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21994: [SPARK-24529][Build][test-maven][follow-up] Add s...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21994#discussion_r20881 --- Diff: pom.xml --- @@ -2609,6 +2609,28 @@ + +com.github.spotbugs +spotbugs-maven-plugin --- End diff -- Adding spotbugs slows down my local build a lot. Can we hold this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 Really thank you for your help! @shaneknapp --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21889 I hit the following error in my local environment. ``` sbt.ForkMain$ForkError: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 220.0 failed 1 times, most recent failure: Lost task 0.0 in stage 220.0 (TID 465, localhost, executor driver): java.lang.IllegalArgumentException: Length -67059888 and offset 140049531604288must be non-negative at org.apache.spark.unsafe.memory.MemoryBlock.(MemoryBlock.java:64) at org.apache.spark.unsafe.memory.OffHeapMemoryBlock.(OffHeapMemoryBlock.java:26) at org.apache.spark.sql.execution.vectorized.OffHeapColumnVector.getBytesAsUTF8String(OffHeapColumnVector.java:221) at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:382) at org.apache.spark.sql.vectorized.ColumnarArray.getUTF8String(ColumnarArray.java:127) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:130) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:406) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Could you turn on the flag in the PR? I want to trigger the tests multiple times in the PR? @ajacques --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 @shaneknapp That is great! I think making Jenkins in a quite mode looks fine, as long as we send out a note to the dev list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 https://lists.apache.org/thread.html/5b0836e44f9386fae2f99deed0a01441c699040c991d833faf520357@%3Cdev.arrow.apache.org%3E Arrow 0.10.0 release is officially announced. @shaneknapp Could you help pull the trigger on the update to arrow? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 We need to upgrade it to 0.10.0 in this release, if possible. It resolves some bugs, e.g., https://issues.apache.org/jira/browse/ARROW-1973. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20611#discussion_r208412882 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -1976,6 +1976,49 @@ private[spark] object Utils extends Logging { } } + def makeQualified(defaultUri: URI, workingDir: Path, path: Path): Path = { --- End diff -- This is a utility function. We normally need to add function comments and param description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21596 @jerryshao This is for 3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21608: [SPARK-24626] [SQL] Improve location size calcula...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21608#discussion_r208408858 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -49,4 +51,11 @@ object DataSourceUtils { } } } + + // SPARK-15895: Metadata files (e.g. Parquet summary files) and temporary files should not be + // counted as data files, so that they shouldn't participate partition discovery. + private[sql] def isDataPath(path: Path): Boolean = { +val name = path.getName +!((name.startsWith("_") && !name.contains("=")) || name.startsWith(".")) --- End diff -- Not sure what is your earlier impl. I would prefer to keeping unchanged the original code in `PartitioningAwareFileIndex.scala`. Just add a utility function `isDataPath ` in CommandUtils.scala. Does this sound good to you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22028: [SPARK-25046][SQL] Fix Alter View can excute sql like "A...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22028 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21977 cc @jiangxb1987 @cloud-fan @jerryshao @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21977 @rdblue Is this for YARN only? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21977 test cases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22006: [SPARK-25031][SQL] Fix MapType schema print
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22006 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22028: [SPARK-25046][SQL] Fix Alter View can excute sql like "A...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22028 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22030: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22030#discussion_r208326382 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -403,20 +415,29 @@ class RelationalGroupedDataset protected[sql]( * * {{{ * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") + * df.groupBy($"year").pivot($"course", Seq(lit("dotNET"), lit("Java"))).sum($"earnings") + * }}} + * + * For pivoting by multiple columns, use the `struct` function to combine the columns and values: + * + * {{{ + * df + * .groupBy($"year") + * .pivot(struct($"course", $"training"), Seq(struct(lit("java"), lit("Experts" + * .agg(sum($"earnings")) * }}} * * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. * @since 2.4.0 */ - def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset = { --- End diff -- @HyukjinKwon I think this change is better than what https://github.com/apache/spark/pull/21699 did. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21608: [SPARK-24626] [SQL] Improve location size calcula...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21608#discussion_r208269963 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -49,4 +51,11 @@ object DataSourceUtils { } } } + + // SPARK-15895: Metadata files (e.g. Parquet summary files) and temporary files should not be + // counted as data files, so that they shouldn't participate partition discovery. + private[sql] def isDataPath(path: Path): Boolean = { +val name = path.getName +!((name.startsWith("_") && !name.contains("=")) || name.startsWith(".")) --- End diff -- Let us do not use the same one with `PartitioningAwareFileIndex.scala`. In this case, the data file is not related to the following condition `name.contains("=")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21970: [SPARK-24996][SQL] Use DSL in DeclarativeAggregate
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21970 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22012: [SPARK-25036][SQL] Should compare ExprValue.isNull with ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22012 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 It sounds like the vote can pass soon. https://lists.apache.org/thread.html/9900da1540be5aafce27691fd40395bb53f465302db29979c154d99a@%3Cdev.arrow.apache.org%3E --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 To get this in, we might need to delay the code freeze. Can you reply the dev list email http://apache-spark-developers-list.1001551.n3.nabble.com/code-freeze-and-branch-cut-for-Apache-Spark-2-4-td24365.html ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21939 After the code freeze, the dependency changes are not allowed. Hopefully, we can make it before that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r207850329 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2225,19 +2225,21 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { test("SPARK-23723: specified encoding is not matched to actual encoding") { -val fileName = "test-data/utf16LE.json" -val schema = new StructType().add("firstName", StringType).add("lastName", StringType) -val exception = intercept[SparkException] { - spark.read.schema(schema) -.option("mode", "FAILFAST") -.option("multiline", "true") -.options(Map("encoding" -> "UTF-16BE")) -.json(testFile(fileName)) -.count() -} -val errMsg = exception.getMessage +withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> "false") { --- End diff -- How about CSV? Could you add the same one too? Also, we need to add the verification logic when the conf is true. ``` Seq(true, false).foreach { optimizeEmptySchema => withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> optimizeEmptySchema.toString) { ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21970: [SPARK-24996][SQL] Use DSL in DeclarativeAggregate
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21970 LGTM pending Jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org