[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat

2018-04-16 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21073#discussion_r181943238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -115,6 +116,62 @@ case class MapValues(child: Expression)
   override def prettyName: String = "map_values"
 }
 
+/**
+ * Returns the union of all the given maps.
+ */
+@ExpressionDescription(
+usage = "_FUNC_(map, ...) - Returns the union of all the given maps",
+examples = """
+Examples:
+  > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd'));
+   [[1 -> "a"], [2 -> "c"], [3 -> "d"]
+  """)
+case class MapConcat(children: Seq[Expression]) extends Expression
+  with CodegenFallback {
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+// this check currently does not allow valueContainsNull to vary,
+// and unfortunately none of the MapType toString methods include
+// valueContainsNull for the error message
+if (children.exists(!_.dataType.isInstanceOf[MapType])) {
+  TypeCheckResult.TypeCheckFailure(
+s"The given input of function $prettyName should all be of type 
map, " +
+  "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
+} else if (children.map(_.dataType).distinct.length > 1) {
+  TypeCheckResult.TypeCheckFailure(
+s"The given input maps of function $prettyName should all be the 
same type, " +
+  "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
+} else {
+  TypeCheckResult.TypeCheckSuccess
+}
+  }
+  override def dataType: MapType = {
+children.headOption.map(_.dataType.asInstanceOf[MapType])
+  .getOrElse(MapType(keyType = StringType, valueType = StringType))
+  }
+
+  override def nullable: Boolean = false
--- End diff --

@henryr empty map:


scala> df.select(map_concat('map1, 'map2).as('newMap)).show
df.select(map_concat('map1, 'map2).as('newMap)).show
+--+
|newMap|
+--+
|[]|
|[]|
+--+

Presto docs (from which the proposed spec comes) are quiet on the matter. 
Even after looking at the Presto code, I am still hard-pressed to say.

I did divine from the Presto code that there should be at least two inputs 
(and I don't currently verify that).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat

2018-04-16 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21073#discussion_r181915827
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -115,6 +116,62 @@ case class MapValues(child: Expression)
   override def prettyName: String = "map_values"
 }
 
+/**
+ * Returns the union of all the given maps.
+ */
+@ExpressionDescription(
+usage = "_FUNC_(map, ...) - Returns the union of all the given maps",
+examples = """
+Examples:
+  > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd'));
+   [[1 -> "a"], [2 -> "c"], [3 -> "d"]
+  """)
+case class MapConcat(children: Seq[Expression]) extends Expression
+  with CodegenFallback {
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+// this check currently does not allow valueContainsNull to vary,
+// and unfortunately none of the MapType toString methods include
+// valueContainsNull for the error message
+if (children.exists(!_.dataType.isInstanceOf[MapType])) {
+  TypeCheckResult.TypeCheckFailure(
+s"The given input of function $prettyName should all be of type 
map, " +
+  "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
+} else if (children.map(_.dataType).distinct.length > 1) {
+  TypeCheckResult.TypeCheckFailure(
+s"The given input maps of function $prettyName should all be the 
same type, " +
+  "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
+} else {
+  TypeCheckResult.TypeCheckSuccess
+}
+  }
+  override def dataType: MapType = {
+children.headOption.map(_.dataType.asInstanceOf[MapType])
+  .getOrElse(MapType(keyType = StringType, valueType = StringType))
+  }
+
+  override def nullable: Boolean = false
--- End diff --

What's the result of `map_concat(NULL, NULL)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat

2018-04-14 Thread bersprockets
GitHub user bersprockets opened a pull request:

https://github.com/apache/spark/pull/21073

[SPARK-23936][SQL][WIP] Implement map_concat

## What changes were proposed in this pull request?

Implement map_concat high order function.

This is a work in progress.

There's no code generation yet.

My current implementation of MapConcat.checkInputDataTypes does not allow 
valueContainsNull to vary between the input maps. Not sure what the 
requirements are here.

I am using a java.util.Map implementation to merge all the maps together, 
since that is a very straightforward implementation. I chose 
java.util.LinkedHashMap because:

- It allows for predicatable tuple order (essentially, the original 
insertion order) for building the result MapData. Tests, like pyspark's 
doctests, which rely on tuple order, will work across Java versions
- java.util.LinkedHashMap seems to be about as fast as java.util.HashMap, 
at least when used to concatenate big (500+ key/values) maps for 150k rows, and 
it's much faster than the scala LinkedHashMap implementation.

## How was this patch tested?

New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bersprockets/spark SPARK-23936

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21073.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21073


commit 707330cd88b269cb1bbee83b9b6476d05c8d177c
Author: Bruce Robbins 
Date:   2018-04-14T23:52:37Z

Initial commit

commit 84d696313972a237691eb46cad6a478167dbabee
Author: Bruce Robbins 
Date:   2018-04-15T02:04:45Z

Remove unused variable in test

commit d04893bccbd2772eb937125895519228588e74b4
Author: Bruce Robbins 
Date:   2018-04-15T03:35:47Z

Cleanup




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org