[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23005 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22962 cc @jiangxb1987 @MrBago --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/23005 [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1 ## What changes were proposed in this pull request? Based on the release description of ANTRL 4.7.1., https://github.com/antlr/antlr4/releases, let us upgrade our parser to 4.7.1. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark upgradeAntlr4.7 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23005.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23005 commit 71b45f0a5abfd37dfa9fb1ab20e79ce8e06017f6 Author: gatorsmile Date: 2018-11-11T08:48:29Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21766 @wangtao605 Do you mind documenting our behavior in our Spark SQL doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17648: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17648 @ptkool Thanks for your contribution! This feature will be available in the next release. Spark 3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21679: [SPARK-24695] [SQL]: To add support to return Calendar i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21679 For your information, @cloud-fan and I are discussing how we expose calendar interval data types. Will post the proposal later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22932 LGTM. Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232443802 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala --- @@ -36,11 +37,17 @@ private[orc] class OrcOutputWriter( private[this] val serializer = new OrcSerializer(dataSchema) private val recordWriter = { -new OrcOutputFormat[OrcStruct]() { +val orcOutputFormat = new OrcOutputFormat[OrcStruct]() { override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { new Path(path) } -}.getRecordWriter(context) +} +val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") +val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) +val writer = OrcFile.createWriter(filename, options) +val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) --- End diff -- This is basically copied from getRecordWriter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r232441937 --- Diff: docs/sql-programming-guide.md --- @@ -706,7 +706,7 @@ data across a fixed number of buckets and can be used when a number of unique va [Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema -of the original data. When writing Parquet files, all columns are automatically converted to be nullable for +of the original data. When reading Parquet files, all columns are automatically converted to be nullable for --- End diff -- This file has been re-org . Could you merge the latest master? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22759: [MINOR][SQL][DOC] Correct parquet nullability documentat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22759 LGTM Could you do us a favor to add the test cases for ensuring that the generated parquet files have a correct nullability value? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232424657 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter( override def close(): Unit = { if (recordWriterInstantiated) { + // Hive 1.2.1 ORC initializes its private `writer` field at the first write. + try { +val writerField = recordWriter.getClass.getDeclaredField("writer") +writerField.setAccessible(true) +val writer = writerField.get(recordWriter).asInstanceOf[Writer] +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) + } catch { +case NonFatal(e) => log.warn(e.toString, e) + } --- End diff -- The same comment here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232424626 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala --- @@ -36,11 +41,17 @@ private[orc] class OrcOutputWriter( private[this] val serializer = new OrcSerializer(dataSchema) private val recordWriter = { -new OrcOutputFormat[OrcStruct]() { +val orcOutputFormat = new OrcOutputFormat[OrcStruct]() { override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { new Path(path) } -}.getRecordWriter(context) +} +val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") +val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) +val writer = OrcFile.createWriter(filename, options) +val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) --- End diff -- Could we create a separate function for adding these metadata? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22987: [SPARK-25979][SQL] Window function: allow parentheses ar...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22987 LGTM Thanks! Merged to master. I accidentally merged it to 2.4. Feel free to revert it if anybody has a concern about this fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22990: [SPARK-25988] [SQL] Keep names unchanged when deduplicat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22990 Thanks! Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22990: [SPARK-25988] [SQL] Keep names unchanged when deduplicat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22990 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22990: [SPARK-25988] [SQL] Keep names unchanged when ded...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/22990 [SPARK-25988] [SQL] Keep names unchanged when deduplicating the column names in Analyzer ## What changes were proposed in this pull request? When the queries do not use the column names with the same case, users might hit various errors. Below is a typical test failure they can hit. ``` Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#237), (cast(tdate#237 as string) >= 2017-08-15)); org.apache.spark.sql.AnalysisException: Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#237), (cast(tdate#237 as string) >= 2017-08-15)); at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925) ``` ## How was this patch tested? Added two test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark fix1283 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22990.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22990 commit 5e9f6f345b93d3370906c7b2d73ede15f4089c29 Author: gatorsmile Date: 2018-11-09T05:27:37Z fix commit 17b725c79ad602df20c44cacb92e7c6abd84cdda Author: gatorsmile Date: 2018-11-09T05:33:58Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22987: [SPARK-25979][SQL] Window function: allow parentheses ar...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22987 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22987#discussion_r232136609 --- Diff: sql/core/src/test/resources/sql-tests/inputs/window.sql --- @@ -109,3 +109,9 @@ last_value(false, false) OVER w AS last_value_contain_null FROM testData WINDOW w AS () ORDER BY cate, val; + +-- parentheses around window reference +SELECT cate, sum(val) OVER (w) +FROM testData +WHERE val is not null +WINDOW w AS (PARTITION BY cate ORDER BY val); --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22969: [SPARK-22827][SQL][FOLLOW-UP] Throw `SparkOutOfMe...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22969#discussion_r232112326 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -787,7 +789,7 @@ case class HashAggregateExec( |$unsafeRowKeys, ${hashEval.value}); | if ($unsafeRowBuffer == null) { |// failed to allocate the first page - |throw new OutOfMemoryError("No enough memory for aggregation"); --- End diff -- opened a JIRA for banning this by a new lint rule: https://issues.apache.org/jira/browse/SPARK-25986 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21602: [SPARK-24613][SQL] Cache with UDF could not be ma...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21602#discussion_r232107563 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -142,7 +142,7 @@ class CacheManager extends Logging { // Remove the cache entry before we create a new one, so that we can have a different // physical plan. it.remove() -val plan = spark.sessionState.executePlan(cd.plan).executedPlan +val plan = spark.sessionState.executePlan(AnalysisBarrier(cd.plan)).executedPlan --- End diff -- We need a test case to cover this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22949: [minor] update known_translations
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22949 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22947: [SPARK-24913][SQL] Make AssertNotNull and AssertT...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22947#discussion_r231230211 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -66,6 +66,8 @@ case class AssertTrue(child: Expression) extends UnaryExpression with ImplicitCa override def nullable: Boolean = true + override lazy val deterministic: Boolean = false --- End diff -- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAssertTrue.java#L45 Hive also marks it as non-deterministic. cc @maryannxue too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21596: [SPARK-24601] Update Jackson to 2.9.6
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21596 It sounds like Jackson have an important behavior change that might impact us. ```Scala import com.fasterxml.jackson.annotation.JsonInclude.Include import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper} import com.fasterxml.jackson.module.scala.DefaultScalaModule import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper val mapper = new ObjectMapper with ScalaObjectMapper mapper.setSerializationInclusion(Include.NON_NULL) mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false) mapper.registerModule(DefaultScalaModule) println(mapper.writeValueAsString(Map("abc" -> null))) ``` Previously, it outputs `{"abc":null}`, but now it is `{}`, unless we explicitly set it to `Include.ALWAYS`. The upgrade looks risky to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22949: [minor] update known_translations
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22949#discussion_r230858019 --- Diff: dev/create-release/known_translations --- @@ -203,3 +203,61 @@ shenh062326 - Shen Hong aokolnychyi - Anton Okolnychyi linbojin - Linbo Jin lw-lin - Liwei Lin +10110346 - Xian Liu +Achuth17 - Achuth Narayan Rajagopal +Adamyuanyuan - Adam Wang +DylanGuedes - Dylan Guedes +JiahuiJiang - Jiahui Jiang +KevinZwx - Kevin Zhang +LantaoJin - Lantao Jin +Lemonjing - Rann Tao +LucaCanali - Luca Canali +XD-DENG - Xiaodong Deng +aai95 - Aleksei Izmalkin +akonopko - Alexander Konopko +ankuriitg - Ankur Gupta +arucard21 - Riaas Mokiem +attilapiros - Attila Zsolt Piros +bravo-zhang - Bravo Zhang +caneGuy - Kang Zhou +chaoslawful - Xiaozhe Wang +cluo512 - Chuan Luo +codeatri - Neha Patil +crafty-coder - Carlos Pena +debugger87 - Chaozhong Yang +e-dorigatti - Emilio Dorigatti +eric-maynard - Eric Maynard +felixalbani - Felix Albani +fjh100456 - fjh100456 +guoxiaolongzte - Xiaolong Guo +heary-cao - Xuewen Cao +huangweizhe123 - Weizhe Huang +ivoson - Tengfei Huang +jinxing64 - Jin Xing +liu-zhaokun - Zhaokun Liu --- End diff -- Also add > KaiXinXiaoLei - LeiLei Hu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22949: [minor] update known_translations
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22949#discussion_r230856955 --- Diff: dev/create-release/known_translations --- @@ -203,3 +203,61 @@ shenh062326 - Shen Hong aokolnychyi - Anton Okolnychyi linbojin - Linbo Jin lw-lin - Liwei Lin +10110346 - Xian Liu +Achuth17 - Achuth Narayan Rajagopal +Adamyuanyuan - Adam Wang +DylanGuedes - Dylan Guedes +JiahuiJiang - Jiahui Jiang +KevinZwx - Kevin Zhang +LantaoJin - Lantao Jin +Lemonjing - Rann Tao +LucaCanali - Luca Canali +XD-DENG - Xiaodong Deng +aai95 - Aleksei Izmalkin +akonopko - Alexander Konopko +ankuriitg - Ankur Gupta +arucard21 - Riaas Mokiem +attilapiros - Attila Zsolt Piros +bravo-zhang - Bravo Zhang +caneGuy - Kang Zhou +chaoslawful - Xiaozhe Wang +cluo512 - Chuan Luo +codeatri - Neha Patil +crafty-coder - Carlos Pena +debugger87 - Chaozhong Yang +e-dorigatti - Emilio Dorigatti +eric-maynard - Eric Maynard +felixalbani - Felix Albani +fjh100456 - fjh100456 --- End diff -- Jinhua Fu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22928 Keeping them in separate source files is also fine to me. I think we should put them in the same package. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22889#discussion_r230614726 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -883,6 +883,31 @@ class Dataset[T] private[sql]( join(right, Seq(usingColumn)) } + /** +* Equi-join with another `DataFrame` using the given column. +* +* Different from other join functions, the join column will only appear once in the output, +* i.e. similar to SQL's `JOIN USING` syntax. +* +* {{{ +* // Left join of df1 and df2 using the column "user_id" +* df1.join(df2, "user_id", "left") +* }}} +* +* @param right Right side of the join operation. +* @param usingColumn Name of the column to join on. This column must exist on both sides. +* @param joinType Type of join to perform. Default `inner`. Must be one of: +* `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`, +* `right`, `right_outer`, `left_semi`, `left_anti`. +* @note If you perform a self-join using this function without aliasing the input +* `DataFrame`s, you will NOT be able to reference any columns after the join, since +* there is no way to disambiguate which side of the join you would like to reference. +* @group untypedrel +*/ + def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame = { --- End diff -- Normally, we do not add such an API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22913 Also cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22693#discussion_r230609824 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -115,26 +116,45 @@ class ResolveHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan] { class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case filterPlan @ Filter(_, SubqueryAlias(_, relation: HiveTableRelation)) => + val predicates = PhysicalOperation.unapply(filterPlan).map(_._2).getOrElse(Nil) + computeTableStats(relation, predicates) case relation: HiveTableRelation if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty => - val table = relation.tableMeta - val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { -try { - val hadoopConf = session.sessionState.newHadoopConf() - val tablePath = new Path(table.location) - val fs: FileSystem = tablePath.getFileSystem(hadoopConf) - fs.getContentSummary(tablePath).getLength -} catch { - case e: IOException => -logWarning("Failed to get table size from hdfs.", e) -session.sessionState.conf.defaultSizeInBytes -} - } else { -session.sessionState.conf.defaultSizeInBytes + computeTableStats(relation) + } + + private def computeTableStats( + relation: HiveTableRelation, + predicates: Seq[Expression] = Nil): LogicalPlan = { +val table = relation.tableMeta +val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { + try { +val hadoopConf = session.sessionState.newHadoopConf() +val tablePath = new Path(table.location) +val fs: FileSystem = tablePath.getFileSystem(hadoopConf) +BigInt(fs.getContentSummary(tablePath).getLength) + } catch { +case e: IOException => + logWarning("Failed to get table size from hdfs.", e) + getSizeInBytesFromTablePartitions(table.identifier, predicates) } +} else { + getSizeInBytesFromTablePartitions(table.identifier, predicates) +} +val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = sizeInBytes))) +relation.copy(tableMeta = withStats) + } - val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes - relation.copy(tableMeta = withStats) + private def getSizeInBytesFromTablePartitions( + tableIdentifier: TableIdentifier, + predicates: Seq[Expression] = Nil): BigInt = { +session.sessionState.catalog.listPartitionsByFilter(tableIdentifier, predicates) match { --- End diff -- The perf will be pretty bad when the number of partitions is huge. Thus, I think we can close this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19796: [SPARK-22581][SQL] Catalog api does not allow to ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19796#discussion_r230609716 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala --- @@ -411,7 +410,29 @@ abstract class Catalog { tableName: String, source: String, schema: StructType, - options: Map[String, String]): DataFrame + options: Map[String, String]): DataFrame = { +createTable(tableName, source, schema, options, Nil) + } + + /** +* :: Experimental :: +* (Scala-specific) +* Create a table based on the dataset in a data source, a schema, a set of options and a set of partition columns. +* Then, returns the corresponding DataFrame. +* +* @param tableName is either a qualified or unqualified name that designates a table. +* If no database identifier is provided, it refers to a table in +* the current database. +* @since ??? +*/ + @Experimental + @InterfaceStability.Evolving + def createTable( +tableName: String, +source: String, +schema: StructType, +options: Map[String, String], +partitionColumnNames : Seq[String]): DataFrame --- End diff -- I think we will not introduce a new API for partitioning columns in the current stage. Let us use SQL DDL for creating the table. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22089: [SPARK-25098][SQL]‘Cast’ will return NULL when input...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22089 @wangyum Could you please take it over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22089: [SPARK-25098][SQL]‘Cast’ will return NULL whe...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22089#discussion_r230609486 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala --- @@ -98,6 +98,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper { c.set(Calendar.MILLISECOND, 0) checkEvaluation(Cast(Literal("2015-03-18"), DateType), new Date(c.getTimeInMillis)) checkEvaluation(Cast(Literal("2015-03-18 "), DateType), new Date(c.getTimeInMillis)) +checkEvaluation(Cast(Literal(" 2015-03-18"), DateType), new Date(c.getTimeInMillis)) --- End diff -- > SELECT CAST(' 22-OCT-1997' AS TIMESTAMP) FROM dual; Oracle also trims the leading space. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceCommand does ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22941 I think this is not a bug, although the plan is confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22930: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand'...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22930#discussion_r230609078 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala --- @@ -37,13 +37,12 @@ case class SaveIntoDataSourceCommand( query: LogicalPlan, dataSource: CreatableRelationProvider, options: Map[String, String], -mode: SaveMode) extends RunnableCommand { +mode: SaveMode, +outputColumnNames: Seq[String]) extends DataWritingCommand { - override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query) - - override def run(sparkSession: SparkSession): Seq[Row] = { -dataSource.createRelation( - sparkSession.sqlContext, mode, options, Dataset.ofRows(sparkSession, query)) --- End diff -- See what I commented in https://github.com/apache/spark/pull/22941 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22941#discussion_r230609046 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala --- @@ -589,4 +590,33 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { sql("INSERT INTO TABLE test_table SELECT 2, null") } } + + test("SPARK-25936 InsertIntoDataSourceCommand does not use Cached Data") { --- End diff -- You can move this test suite to CachedTableSuite.scala and use the helper functions to verify whether the cache is used. See the example. ``` spark.range(2).createTempView("test_view") spark.catalog.cacheTable("test_view") val rddId = rddIdOf("test_view") assert(!isMaterialized(rddId)) sql("INSERT INTO TABLE test_table SELECT * FROM test_view") assert(isMaterialized(rddId)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22941#discussion_r230608937 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoDataSourceCommand.scala --- @@ -30,14 +30,13 @@ import org.apache.spark.sql.sources.InsertableRelation case class InsertIntoDataSourceCommand( logicalRelation: LogicalRelation, query: LogicalPlan, -overwrite: Boolean) - extends RunnableCommand { +overwrite: Boolean, +outputColumnNames: Seq[String]) + extends DataWritingCommand { - override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query) - - override def run(sparkSession: SparkSession): Seq[Row] = { + override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = { val relation = logicalRelation.relation.asInstanceOf[InsertableRelation] -val data = Dataset.ofRows(sparkSession, query) --- End diff -- This will use the cached data, although the plan does not show the cached data is used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22936: [SPARK-19799] Support WITH clause (CTE) in subqueries
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22936 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22932 Will take a look this week. Thanks for your work! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230563752 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- This is caused by adding `org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22255 Just to confirm it. `created_by` is set to `parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22921 LGTM except a minor comment about the test case. Also we need to fix the PySpark test failure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230283948 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -279,27 +254,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { structDf.select(hash($"a", $"record.*"))) } - test("Star Expansion - explode should fail with a meaningful message if it takes a star") { --- End diff -- Can we rewrite this test case using select(explode()), like what we did in the following test cases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22255 Also cc @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22255 @dongjoon-hyun Do you want to take this over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22255 https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L902 @rdblue Can we use created_by? ``` /** String for application that wrote this file. This should be in the format * version (build ). * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) **/ 6: optional string created_by ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15599: [SPARK-18022][SQL] java.lang.NullPointerException instea...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15599 @shatestest Your problem is different from the issue this PR tries to resolve. If you can provide a test case to reproduce it, feel free to open a JIRA --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22927: [SPARK-25918][SQL] LOAD DATA LOCAL INPATH should ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22927#discussion_r230217147 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -393,7 +393,7 @@ object LoadDataCommand { throw new IllegalArgumentException(e) } } else { - path + new Path(pathUri) --- End diff -- Here, it is converted PATH to URI and then converted back to Path. What is your goal rather than directly building a path? ``` if (path.isAbsolute()) path else new Path(workingDir, path) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22927: [SPARK-25918][SQL] LOAD DATA LOCAL INPATH should ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22927#discussion_r230216125 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -1987,6 +1987,13 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("SPARK-25918: LOAD DATA LOCAL INPATH should handle a relative path") { --- End diff -- This does not belong to this test suite. `HiveCommandSuite.scala` is the best place, although this is not for hive module. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22924: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22924 I would be more conservative in these cases. The new release of Py4J might introduce new regressions. Not all the changes are clearly documented in their release notes. It is kind of aggressive to change the dependence in our maintenance releases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type which r...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22905 @mallman Could you run the EXPLAIN with this new changes and post it in the PR description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21688 cc @gengliangwang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22924: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22924 I am against this change in 2.4 branch unless the bugs are critical for our end users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22788: [SPARK-25769][SQL]make UnresolvedAttribute.sql escape ne...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22788 This sounds a safe change. cc @liancheng --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22910: [SPARK-25899][TESTS]Fix flaky CoarseGrainedSchedulerBack...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22910 Thanks! Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229879855 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- Could you try some very long alias names or complex expressions? You will get different number, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22778: [SPARK-25784][SQL] Infer filters from constraints after ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22778 Let us hold this PR and try to fix https://github.com/apache/spark/pull/17520 instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22890: [SPARK-25883][SQL][Minor] Override method `prettyName` i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22890 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229572426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,18 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We cannot know how many bytecode will " + --- End diff -- `The threshold of source-code splitting in the codegen. When the number of characters in a single JAVA function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22870 LGTM Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17520 @dilipbiswal Could you take this over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229361802 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- Sure. That sounds also good to me. @dilipbiswal Could you take the PR https://github.com/apache/spark/pull/17520 over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229178084 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- I do not have a good answer for this PR. Ideally, we should run the whole batch `operatorOptimizationBatch`. However, running the rules could be very time consuming. I would suggest to add a new parameter for introducing the time bound limit for each batch. cc @maryannxue WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22879: [SPARK-25872][SQL][TEST] Add an optimizer tracker for TP...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22879 For generating the golden files, I would not generate the whole plan. Instead, we can keep the node name. That should be good enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22857 Please be really careful in null handling. It could easily introduce the correctness bugs like what we recently fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22878 cc @gengliangwang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22870: [SPARK-25862][SQL] Remove rangeBetween APIs intro...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22870#discussion_r228776577 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala --- @@ -206,7 +206,7 @@ case class SpecifiedWindowFrame( // Check combination (of expressions). (lower, upper) match { case (l: Expression, u: Expression) if !isValidFrameBoundary(l, u) => -TypeCheckFailure(s"Window frame upper bound '$upper' does not followes the lower bound " + +TypeCheckFailure(s"Window frame upper bound '$upper' does not follow the lower bound " + --- End diff -- We need to rerun org.apache.spark.sql.SQLQueryTestSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22514 This requires more careful review. I do not think we can make it in RC5 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22817 LGTM Thanks! Merged to master/2.4/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22817 The fix looks fine to me. cc @cloud-fan @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22817: [SPARK-25816][SQL] Fix attribute resolution in ne...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22817#discussion_r228729920 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2578,4 +2578,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { Row ("abc", 1)) } } + + test("SPARK-25816 ResolveReferences works with nested extractors") { +val df0 = Seq((1, Map(1 -> "a")), (2, Map(2 -> "b"))).toDF("1", "2") +val df1 = df0.select($"1".as("2"), $"2".as("1")) +val df2 = df1.filter($"1"(map_keys($"1")(0)) > "a") --- End diff -- We are unable to resolve the expressions in `extraction` of `UnresolvedExtractValue`. We can simplify the expression in the `extraction`. For example, `df1.filter($"1"($"2") > "a")`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22817 Will take a look at this tonight. Thanks for reporting this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22827: [SPARK-25832][SQL][BRANCH-2.4] Revert newly added...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/22827 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22827: [SPARK-25832][SQL] Revert newly added map related...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22827#discussion_r228267525 --- Diff: sql/core/src/test/resources/sql-tests/inputs/higher-order-functions.sql --- @@ -60,8 +60,3 @@ select zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, -- Zip with array coalesce select zip_with(array('a'), array('d', null, 'f'), (x, y) -> coalesce(x, y)) as v; - -create or replace temporary view nested as values - (1, map(1, 1, 2, 2, 3, 3)), - (2, map(4, 4, 5, 5, 6, 6)) - as t(x, ys); --- End diff -- That has been removed already --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22827: [SPARK-25832][SQL] Revert newly added map related functi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22827 @dongjoon-hyun I think that can be kept. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22827: [SPARK-25832][SQL] Revert newly added map related functi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22827 @felixcheung @mengxr The last commit https://github.com/apache/spark/pull/22827/commits/f1f3d0b09527d28fd8a1f1a2c8e2358c10da175b is for R related revert. Please take a look --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22827: [SPARK-25832][SQL] Revert newly added map related functi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22827 @cloud-fan @dongjoon-hyun @rxin @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22827: [SPARK-25832][SQL] Revert newly added map related...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/22827 [SPARK-25832][SQL] Revert newly added map related functions ## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: https://github.com/apache/spark/pull/21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: https://github.com/apache/spark/pull/21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: https://github.com/apache/spark/pull/22045 - Revert [SPARK-23939][SQL] Add transform_keys function: https://github.com/apache/spark/pull/22013 - Revert [SPARK-23932][SQL] Higher order function zip_with: https://github.com/apache/spark/pull/22031 - Revert [SPARK-23938][SQL] Add map_zip_with function: https://github.com/apache/spark/pull/22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: https://github.com/apache/spark/pull/21434/ ## How was this patch tested? The existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark revertMap2.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22827.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22827 commit b8cab59abc45d9178785829f6a455e8a7a6a61fa Author: gatorsmile Date: 2018-10-25T16:05:20Z Revert [SPARK-23935][SQL] Adding map_entries function commit 25498322863ccad2a1481c939dcd0392862baf55 Author: gatorsmile Date: 2018-10-25T16:18:40Z Revert [SPARK-23937][SQL] Add map_filter SQL function commit 8623f864366ba2e7a38150ee4aa119e90b050155 Author: gatorsmile Date: 2018-10-25T16:23:17Z Revert [SPARK-23940][SQL] Add transform_values SQL function commit 666d6ab25f6d12f27510740badbf7501fd08168a Author: gatorsmile Date: 2018-10-25T16:27:03Z Revert [SPARK-23939][SQL] Add transform_keys function commit 59393872427866d032eefe2e04e7ac6d94e52219 Author: gatorsmile Date: 2018-10-25T16:34:19Z Revert [SPARK-23938][SQL] Add map_zip_with function commit f1f3d0b09527d28fd8a1f1a2c8e2358c10da175b Author: gatorsmile Date: 2018-10-25T16:40:55Z Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22821: [SPARK-25832][SQL] remove newly added map related functi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22821 I will submit a PR to revert all these changes. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22773 Yes. The goal of migration guide is for helping end users upgrade their Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22773 All the things that could break the existing user applications should be documented in the migration guide. This will simplify the system upgrade of our end users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22773 This is an external change. This needs a migration guide update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22817: [SPARK-25816][SQL] ResolveReferences should work bottom-...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22817 This would not work, since we need both bottom-up and top-down. You would hit the test failure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22794: [SPARK-24570][SQL] Implement Spark own GetTablesO...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22794#discussion_r227872504 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetTablesOperation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.thriftserver + +import java.util.{List => JList, Map => JMap, UUID} + +import scala.collection.JavaConverters.seqAsJavaListConverter + +import org.apache.hadoop.hive.ql.security.authorization.plugin.HiveOperationType +import org.apache.hadoop.hive.ql.security.authorization.plugin.HivePrivilegeObjectUtils +import org.apache.hive.service.cli._ +import org.apache.hive.service.cli.operation.GetTablesOperation +import org.apache.hive.service.cli.session.HiveSession + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.catalyst.catalog.CatalogTableType._ +import org.apache.spark.sql.catalyst.catalog.SessionCatalog + +private[hive] class SparkGetTablesOperation( +parentSession: HiveSession, +catalogName: String, +schemaName: String, +tableName: String, +tableTypes: JList[String]) +(sqlContext: SQLContext, sessionToActivePool: JMap[SessionHandle, String]) --- End diff -- Just add sqlContext in the class parameter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22794: [SPARK-24570][SQL] Implement Spark own GetTablesO...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22794#discussion_r227872346 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetTablesOperation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.thriftserver + +import java.util.{List => JList, Map => JMap, UUID} + +import scala.collection.JavaConverters.seqAsJavaListConverter + +import org.apache.hadoop.hive.ql.security.authorization.plugin.HiveOperationType +import org.apache.hadoop.hive.ql.security.authorization.plugin.HivePrivilegeObjectUtils +import org.apache.hive.service.cli._ +import org.apache.hive.service.cli.operation.GetTablesOperation +import org.apache.hive.service.cli.session.HiveSession + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.catalyst.catalog.CatalogTableType._ +import org.apache.spark.sql.catalyst.catalog.SessionCatalog + +private[hive] class SparkGetTablesOperation( +parentSession: HiveSession, +catalogName: String, +schemaName: String, +tableName: String, +tableTypes: JList[String]) +(sqlContext: SQLContext, sessionToActivePool: JMap[SessionHandle, String]) --- End diff -- Just wondering why we need `sessionToActivePool`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22794: [SPARK-24570][SQL] Implement Spark own GetTablesO...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22794#discussion_r227869218 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -644,6 +644,47 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { assert(resultSet.getString(1) === "4.56") } } + + test("SPARK-24196: SQL client(DBeaver) can't show tables") { --- End diff -- Create a `SparkMetadataOperationSuite`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22794: [SPARK-24570][SQL] Implement Spark own GetTablesOperatio...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22794 cc @srinathshankar --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22144 I think this is different from the blocker tickets we opened before. We should try our best to avoid accidentally dropping the existing support. Please encourage more people in the community to try our RCs and find out all the new regressions or bugs. During the RC stage, for the new regressions, we can either fix it if the fix is very safe/tiny; or revert the PRs that introduce the regressions. This is how we handle the regressions during the RC stage. For this specific case, I do not think the root cause is found. If we revert the previous PRs https://issues.apache.org/jira/browse/SPARK-18186 that were merged in 2.2 release, it could easily introduce new regressions. Thus, we normally do not revert the PRs that were merged in the previous releases. Please add it as a known issue in the release note. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22772 LGTM Let us merge this PR first. If we find more, feel free to open a new follow-up PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needsUnsafeR...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22750 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needs...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22750#discussion_r226818662 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -168,10 +168,11 @@ case class FileSourceScanExec( // Note that some vals referring the file-based relation are lazy intentionally // so that this plan can be canonicalized on executor side too. See SPARK-23731. - override lazy val supportsBatch: Boolean = relation.fileFormat.supportBatch( -relation.sparkSession, StructType.fromAttributes(output)) + override lazy val supportsBatch: Boolean = { +relation.fileFormat.supportBatch(relation.sparkSession, schema) + } - override lazy val needsUnsafeRowConversion: Boolean = { + private lazy val needsUnsafeRowConversion: Boolean = { if (relation.fileFormat.isInstanceOf[ParquetSource]) { --- End diff -- How about ORC? Why we only need to check the Parquet? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22750: [SPARK-25747][SQL] remove ColumnarBatchScan.needs...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22750#discussion_r226806646 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala --- @@ -164,12 +162,11 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport { val outputVars = output.zipWithIndex.map { case (a, i) => --- End diff -- set outputVars to null? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22756: [SPARK-25758][ML] Deprecate computeCost on BisectingKMea...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22756 Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22756: [SPARK-25758][ML] Deprecate computeCost on BisectingKMea...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22756 Let me revert it. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22756: [SPARK-25758][ML] Deprecate computeCost on BisectingKMea...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22756 cc @mengxr WDYT? It does not sound a blocker to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor sig...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22732 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor sig...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22732 LGTM pending Jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor sig...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22732 For Scala 2.11, we should not introduce any behavior change and also keep binary and source compatibility. ``` scala> import org.apache.spark.sql.types.DataTypes import org.apache.spark.sql.types.DataTypes scala> val f2 = udf({(x: Int) => x}, DataTypes.IntegerType) f2: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,IntegerType,None) scala> spark.range(3).select(f2('id + null)).show() ++ |UDF((id + null))| ++ |null| |null| |null| ++ ``` For Scala 2.12, since we are unable to know the type nullability in a few APIs, we issue a warning message in these cases. Below is the example which will generate a different answer: ``` scala> import org.apache.spark.sql.types.DataTypes import org.apache.spark.sql.types.DataTypes scala> val f2 = udf({(x: Int) => x}, DataTypes.IntegerType) f2: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction($Lambda$2801/26868055@5eb35a26,IntegerType,None) scala> spark.range(3).select(f2('id + null)).show() 18/10/18 23:07:02 WARN ScalaReflection: Scala version 2.12.7 cannot get type nullability correctly via reflection, thus Spark cannot add proper input null check for UDF. To avoid this problem, use the typed UDF interfaces instead. ++ |UDF((id + null))| ++ | 0| | 0| | 0| ++ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22769: [SPARK-24499][DOC][FOLLOW-UP] Split the page of s...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/22769 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22769: [SPARK-24499][DOC][FOLLOW-UP] Split the page of s...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/22769 [SPARK-24499][DOC][FOLLOW-UP] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? Forgot to clean remove the link for `Upgrading From Spark SQL 2.4 to 3.0` when merging to 2.4 ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark test2.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22769.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22769 commit 4027ba1097caa053a7d30926b0e407912bfd78f6 Author: gatorsmile Date: 2018-10-18T20:14:17Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22769: [SPARK-24499][DOC][FOLLOW-UP] Split the page of sql-prog...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22769 cc @dongjoon-hyun @xuanyuanking --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org