[GitHub] spark pull request: [SPARK-4366] [SQL] [WIP] Aggregation Improveme...

chenghao-intel Fri, 17 Jul 2015 04:06:08 -0700

Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7458#discussion_r34880526
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate2/aggregates.scala
 ---
    @@ -0,0 +1,279 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.expressions.aggregate2
    +
    +import org.apache.spark.sql.catalyst.dsl.expressions._
    +import org.apache.spark.sql.catalyst.errors.TreeNodeException
    +import org.apache.spark.sql.catalyst.expressions._
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types._
    +
    +/** The mode of an [[AggregateFunction]]. */
    +private[sql] sealed trait AggregateMode
    +
    +/**
    + * An [[AggregateFunction]] with [[Partial]] mode is used for partial 
aggregation.
    + * This function updates the given aggregation buffer with the original 
input of this
    + * function. When it has processed all input rows, the aggregation buffer 
is returned.
    + */
    +private[sql] case object Partial extends AggregateMode
    +
    +/**
    + * An [[AggregateFunction]] with [[PartialMerge]] mode is used to merge 
aggregation buffers
    + * containing intermediate results for this function.
    + * This function updates the given aggregation buffer by merging multiple 
aggregation buffers.
    + * When it has processed all input rows, the aggregation buffer is 
returned.
    + */
    +private[sql] case object PartialMerge extends AggregateMode
    +
    +/**
    + * An [[AggregateFunction]] with [[PartialMerge]] mode is used to merge 
aggregation buffers
    + * containing intermediate results for this function and the generate 
final result.
    + * This function updates the given aggregation buffer by merging multiple 
aggregation buffers.
    + * When it has processed all input rows, the final result of this function 
is returned.
    + */
    +private[sql] case object Final extends AggregateMode
    +
    +/**
    + * An [[AggregateFunction2]] with [[Partial]] mode is used to evaluate 
this function directly
    + * from original input rows without any partial aggregation.
    + * This function updates the given aggregation buffer with the original 
input of this
    + * function. When it has processed all input rows, the final result of 
this function is returned.
    + */
    +private[sql] case object Complete extends AggregateMode
    +
    +private[sql] case object NoOp extends Expression {
    +  override def nullable: Boolean = true
    +  override def eval(input: InternalRow): Any = {
    +    throw new TreeNodeException(
    +      this, s"No function to evaluate expression. type: ${this.nodeName}")
    +  }
    +  override def dataType: DataType = NullType
    +  override def children: Seq[Expression] = Nil
    +}
    +
    +/**
    + * A container for an [[AggregateFunction2]] with its [[AggregateMode]] 
and a field
    + * (`isDistinct`) indicating if DISTINCT keyword is specified for this 
function.
    + * @param aggregateFunction
    + * @param mode
    + * @param isDistinct
    + */
    +private[sql] case class AggregateExpression2(
    +    aggregateFunction: AggregateFunction2,
    +    mode: AggregateMode,
    +    isDistinct: Boolean) extends Expression {
    +
    +  override def children: Seq[Expression] = aggregateFunction :: Nil
    +  override def dataType: DataType = aggregateFunction.dataType
    +  override def foldable: Boolean = false
    +  override def nullable: Boolean = aggregateFunction.nullable
    +
    +  override def toString: String = 
s"(${aggregateFunction}2,mode=$mode,isDistinct=$isDistinct)"
    +
    +  override def eval(input: InternalRow = null): Any = {
    +    throw new TreeNodeException(
    +      this, s"No function to evaluate expression. type: ${this.nodeName}")
    +  }
    +}
    +
    +abstract class AggregateFunction2
    +  extends Expression {
    +
    +  self: Product =>
    +
    +  /** An aggregate function is not foldable. */
    +  override def foldable: Boolean = false
    +
    +  /**
    +   * The offset of this function's buffer in the underlying buffer shared 
with other functions.
    +   */
    +  var bufferOffset: Int = 0
    --- End diff --
    
    Instead of explicit expose the `bufferOffset`, probably people prefer using 
the `BoundReference`, it's created by `BindReferences.bindReference`, I think 
we can pass in the associated `BoundReference` (bind with bufferSchema) while 
`initialize` the `AggregateFunction2` in each executor.
    
    We can add the implicit function in this class, to convert `BoundReference` 
=> Int. So people can write code like:
    `mutableRow.isNullAt(BoundReference)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4366] [SQL] [WIP] Aggregation Improveme...

Reply via email to