[ 
https://issues.apache.org/jira/browse/HIVE-26243?focusedWorklogId=776193&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-776193
 ]

ASF GitHub Bot logged work on HIVE-26243:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 31/May/22 09:55
            Start Date: 31/May/22 09:55
    Worklog Time Spent: 10m 
      Work Description: kgyrtkirk commented on code in PR #3317:
URL: https://github.com/apache/hive/pull/3317#discussion_r885430661


##########
standalone-metastore/pom.xml:
##########
@@ -181,6 +182,17 @@
         <artifactId>commons-lang3</artifactId>
         <version>${commons-lang3.version}</version>
       </dependency>
+      <dependency>
+        <groupId>org.apache.datasketches</groupId>

Review Comment:
   I think this has nothing to do with this patch



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/kll/KllUtils.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.common.histogram.kll;
+
+import org.apache.datasketches.kll.KllFloatsSketch;
+import org.apache.datasketches.memory.Memory;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * KLL serialization utilities.
+ */
+public class KllUtils {

Review Comment:
   unrelated to changes?



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/KllHistogramEstimator.java:
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.common.histogram;
+
+import org.apache.datasketches.kll.KllFloatsSketch;
+import org.apache.hadoop.hive.common.histogram.kll.KllUtils;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+
+public class KllHistogramEstimator {

Review Comment:
   remove these classes from the metastore...



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/kll/KllUtils.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.common.histogram.kll;
+
+import org.apache.datasketches.kll.KllFloatsSketch;
+import org.apache.datasketches.memory.Memory;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * KLL serialization utilities.
+ */
+public class KllUtils {
+
+  private KllUtils() {
+    throw new AssertionError("Suppress default constructor for non 
instantiation");
+  }
+
+  /**
+   * KLL is serialized according to what provided by data-sketches library
+   * @param out output stream to write to
+   * @param kll KLL sketch that needs to be serialized
+   * @throws IOException if an error occurs during serialization
+   */
+  public static void serializeKll(OutputStream out, KllFloatsSketch kll) 
throws IOException {
+    out.write(kll.toByteArray());
+  }

Review Comment:
   do we really need these methods/classes/etc? 
   
   this really just binds the class name; the method name with no good 
reason...no interface; what's the architectural value here I don't see?



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/KllHistogramEstimatorFactory.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hadoop.hive.common.histogram;
+
+import org.apache.hadoop.hive.common.histogram.kll.KllUtils;
+
+public class KllHistogramEstimatorFactory {
+
+  private KllHistogramEstimatorFactory() {
+    throw new AssertionError("Suppress default constructor for non 
instantiation");
+  }
+
+  /**
+   * This function deserializes the serialized KLL histogram estimator from a 
byte array.
+   * @param buf to deserialize
+   * @return KLL histogram estimator
+   */
+  public static KllHistogramEstimator getKllHistogramEstimator(byte[] buf) {
+    return new KllHistogramEstimator(KllUtils.deserializeKll(buf, 0, 
buf.length));
+  }
+
+  /**
+   * This function deserializes the serialized KLL histogram estimator from a 
byte array.
+   * @param buf to deserialize
+   * @param start start index for deserialization
+   * @param len start+len is deserialized
+   * @return KLL histogram estimator
+   */
+  public static KllHistogramEstimator getKllHistogramEstimator(byte[] buf, int 
start, int len) {
+    return new KllHistogramEstimator(KllUtils.deserializeKll(buf, start, len));
+  }
+
+  /**
+   * This method creates an empty histogram estimator with a KLL sketch with 
k=200.
+   * @return an empty histogram estimator with a KLL sketch with k=200
+   */
+  public static KllHistogramEstimator getEmptyHistogramEstimator() {

Review Comment:
   there are 3 `getEmptyHistogramEstimator` -s from which only 2 is being used; 
and I think 1 will be too much already...do we need these at all? what's the 
point of introducing these???



##########
ql/src/gen/vectorization/UDAFTemplates/VectorUDAFComputeKLL.txt:
##########
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates;
+
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimator;
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimatorFactory;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+#IF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.<InputColumnVectorType>;
+#ENDIF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationBufferRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationDesc;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+/**
+ * Generated from template VectorUDAFComputeKLL.txt.

Review Comment:
   this seems to be the vectorized pair of `ds_kll_sketch` or not? 
   
   can we keep the naming conventions? it will be hard to follow 
what-is-what....without doing so



##########
ql/src/gen/vectorization/UDAFTemplates/VectorUDAFComputeKLL.txt:
##########
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates;
+
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimator;
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimatorFactory;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+#IF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.<InputColumnVectorType>;
+#ENDIF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationBufferRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationDesc;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+/**
+ * Generated from template VectorUDAFComputeKLL.txt.
+ */
+@Description(name = "ds_kll_sketch", value = "_FUNC_(x) "
+    + "Returns a KllFloatsSketch in a serialized form as a binary blob."
+    + " Values must be of type float.")
+public class <ClassName> extends VectorAggregateExpression {
+
+  public <ClassName>() {
+    super();
+  }
+
+  public <ClassName>(VectorAggregationDesc vecAggrDesc) {
+    super(vecAggrDesc);
+  }
+
+  @Override
+  public AggregationBuffer getNewAggregationBuffer() throws HiveException {
+    return new Aggregation();
+  }
+
+  @Override
+  public void aggregateInput(AggregationBuffer agg, VectorizedRowBatch batch) 
throws HiveException {
+    inputExpression.evaluate(batch);
+
+#IF COMPLETE
+    <InputColumnVectorType> inputColumn = (<InputColumnVectorType>) 
batch.cols[this.inputExpression.getOutputColumnNum()];
+#ENDIF COMPLETE
+#IF MERGING
+    BytesColumnVector inputColumn = (BytesColumnVector) 
batch.cols[this.inputExpression.getOutputColumnNum()];
+#ENDIF MERGING
+
+    int batchSize = batch.size;
+
+    if (batchSize == 0) {
+      return;
+    }
+
+    Aggregation myagg = (Aggregation) agg;
+
+#IF COMPLETE
+    myagg.prepare();
+    if (inputColumn.noNulls) {
+      if (inputColumn.isRepeating) {
+        for (int i = 0; i < batchSize; i++) {
+          myagg.estimator.addToEstimator(inputColumn.vector[0]);
+        }
+      } else {
+        if (batch.selectedInUse) {
+          for (int s = 0; s < batchSize; s++) {
+            int i = batch.selected[s];
+            myagg.estimator.addToEstimator(inputColumn.vector[i]);
+          }
+        } else {
+          for (int i = 0; i < batchSize; i++) {
+            myagg.estimator.addToEstimator(inputColumn.vector[i]);
+          }
+        }
+      }
+    } else {
+      if (inputColumn.isRepeating) {
+        if (!inputColumn.isNull[0]) {
+          for (int i = 0; i < batchSize; i++) {
+            myagg.estimator.addToEstimator(inputColumn.vector[0]);
+          }
+        }
+      } else {
+        if (batch.selectedInUse) {
+          for (int j = 0; j < batchSize; ++j) {
+            int i = batch.selected[j];
+            if (!inputColumn.isNull[i]) {
+              myagg.estimator.addToEstimator(inputColumn.vector[i]);
+            }
+          }
+        } else {
+          for (int i = 0; i < batchSize; i++) {
+            if (!inputColumn.isNull[i]) {
+              myagg.estimator.addToEstimator(inputColumn.vector[i]);
+            }
+          }
+        }
+      }
+    }
+#ENDIF COMPLETE
+#IF MERGING
+    if (inputColumn.isRepeating) {
+      if (!inputColumn.isNull[0] && inputColumn.length[0] > 0) {
+        myagg.prepare();
+        KllHistogramEstimator mergingKLL = 
KllHistogramEstimatorFactory.getKllHistogramEstimator(

Review Comment:
   do we really have to create and bind 1-by-1 vectorized implementations for 
the datasketches functions ?
   function classes kinda do the same stuff... like `DATA_TO_SKETCH` ; 





Issue Time Tracking
-------------------

            Worklog Id:     (was: 776193)
    Remaining Estimate: 0h
            Time Spent: 10m

> Add vectorized implementation of the 'ds_kll_sketch' UDAF
> ---------------------------------------------------------
>
>                 Key: HIVE-26243
>                 URL: https://issues.apache.org/jira/browse/HIVE-26243
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF, Vectorization
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> _ds_kll_sketch_ UDAF does not have a vectorized implementation at the moment, 
> the present ticket aims at bridging this gap.
> This is particularly important because vectorization has an "all or nothing" 
> approach, so if this function is used at the side of vectorized functions, 
> they won't be able to benefit from vectorized execution.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to