[GitHub] spark pull request #21073: [SPARK-23936][SQL] Implement map_concat

bersprockets Thu, 19 Apr 2018 13:32:00 -0700

Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21073#discussion_r182875327
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
    @@ -116,6 +118,154 @@ case class MapValues(child: Expression)
       override def prettyName: String = "map_values"
     }
     
    +/**
    + * Returns the union of all the given maps.
    + */
    +@ExpressionDescription(
    +usage = "_FUNC_(map, ...) - Returns the union of all the given maps",
    +examples = """
    +    Examples:
    +      > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd'));
    +       [[1 -> "a"], [2 -> "c"], [3 -> "d"]
    +  """)
    +case class MapConcat(children: Seq[Expression]) extends Expression
    +  with CodegenFallback {
    +
    +  override def checkInputDataTypes(): TypeCheckResult = {
    +    // this check currently does not allow valueContainsNull to vary,
    +    // and unfortunately none of the MapType toString methods include
    +    // valueContainsNull for the error message
    +    if (children.size < 2) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"$prettyName expects at least two input maps.")
    +    } else if (children.exists(!_.dataType.isInstanceOf[MapType])) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input of function $prettyName should all be of type 
map, " +
    +          "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else if (children.map(_.dataType).distinct.length > 1) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input maps of function $prettyName should all be the 
same type, " +
    +          "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else {
    +      TypeCheckResult.TypeCheckSuccess
    +    }
    +  }
    +
    +  override def dataType: MapType = {
    +    children.headOption.map(_.dataType.asInstanceOf[MapType])
    +      .getOrElse(MapType(keyType = StringType, valueType = StringType))
    +  }
    +
    +  override def nullable: Boolean = true
    +
    +  override def eval(input: InternalRow): Any = {
    +    val union = new util.LinkedHashMap[Any, Any]()
    +    children.map(_.eval(input)).foreach { raw =>
    +      if (raw == null) {
    +        return null
    +      }
    +      val map = raw.asInstanceOf[MapData]
    +      map.foreach(dataType.keyType, dataType.valueType, (k, v) =>
    +        union.put(k, v)
    +      )
    +    }
    +    val (keyArray, valueArray) = union.entrySet().toArray().map { e =>
    +      val e2 = e.asInstanceOf[java.util.Map.Entry[Any, Any]]
    +      (e2.getKey, e2.getValue)
    +    }.unzip
    +    new ArrayBasedMapData(new GenericArrayData(keyArray), new 
GenericArrayData(valueArray))
    +  }
    +
    +  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
    +    val mapCodes = children.map(c => c.genCode(ctx))
    --- End diff --
    
    Since this logic is big enough (and similar enough to the logic in eval), I 
wonder if the merge logic should be moved to a utility class and called from 
both eval as well as the generated code.
    
    The FromUTCTimestamp expression does something sort of like that, where the 
eval method as well as the generated code both call utility functions in the 
DateTimeUtils scala object. Also, the Concat expression's eval method and 
generated code both call utility functions on UTF8String (although in this 
case, UTF8String is a Java class).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21073: [SPARK-23936][SQL] Implement map_concat

Reply via email to