[
https://issues.apache.org/jira/browse/HIVE-26243?focusedWorklogId=776193&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-776193
]
ASF GitHub Bot logged work on HIVE-26243:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 31/May/22 09:55
Start Date: 31/May/22 09:55
Worklog Time Spent: 10m
Work Description: kgyrtkirk commented on code in PR #3317:
URL: https://github.com/apache/hive/pull/3317#discussion_r885430661
##########
standalone-metastore/pom.xml:
##########
@@ -181,6 +182,17 @@
<artifactId>commons-lang3</artifactId>
<version>${commons-lang3.version}</version>
</dependency>
+ <dependency>
+ <groupId>org.apache.datasketches</groupId>
Review Comment:
I think this has nothing to do with this patch
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/kll/KllUtils.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.common.histogram.kll;
+
+import org.apache.datasketches.kll.KllFloatsSketch;
+import org.apache.datasketches.memory.Memory;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * KLL serialization utilities.
+ */
+public class KllUtils {
Review Comment:
unrelated to changes?
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/KllHistogramEstimator.java:
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.common.histogram;
+
+import org.apache.datasketches.kll.KllFloatsSketch;
+import org.apache.hadoop.hive.common.histogram.kll.KllUtils;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+
+public class KllHistogramEstimator {
Review Comment:
remove these classes from the metastore...
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/kll/KllUtils.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.common.histogram.kll;
+
+import org.apache.datasketches.kll.KllFloatsSketch;
+import org.apache.datasketches.memory.Memory;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * KLL serialization utilities.
+ */
+public class KllUtils {
+
+ private KllUtils() {
+ throw new AssertionError("Suppress default constructor for non
instantiation");
+ }
+
+ /**
+ * KLL is serialized according to what provided by data-sketches library
+ * @param out output stream to write to
+ * @param kll KLL sketch that needs to be serialized
+ * @throws IOException if an error occurs during serialization
+ */
+ public static void serializeKll(OutputStream out, KllFloatsSketch kll)
throws IOException {
+ out.write(kll.toByteArray());
+ }
Review Comment:
do we really need these methods/classes/etc?
this really just binds the class name; the method name with no good
reason...no interface; what's the architectural value here I don't see?
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/common/histogram/KllHistogramEstimatorFactory.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hadoop.hive.common.histogram;
+
+import org.apache.hadoop.hive.common.histogram.kll.KllUtils;
+
+public class KllHistogramEstimatorFactory {
+
+ private KllHistogramEstimatorFactory() {
+ throw new AssertionError("Suppress default constructor for non
instantiation");
+ }
+
+ /**
+ * This function deserializes the serialized KLL histogram estimator from a
byte array.
+ * @param buf to deserialize
+ * @return KLL histogram estimator
+ */
+ public static KllHistogramEstimator getKllHistogramEstimator(byte[] buf) {
+ return new KllHistogramEstimator(KllUtils.deserializeKll(buf, 0,
buf.length));
+ }
+
+ /**
+ * This function deserializes the serialized KLL histogram estimator from a
byte array.
+ * @param buf to deserialize
+ * @param start start index for deserialization
+ * @param len start+len is deserialized
+ * @return KLL histogram estimator
+ */
+ public static KllHistogramEstimator getKllHistogramEstimator(byte[] buf, int
start, int len) {
+ return new KllHistogramEstimator(KllUtils.deserializeKll(buf, start, len));
+ }
+
+ /**
+ * This method creates an empty histogram estimator with a KLL sketch with
k=200.
+ * @return an empty histogram estimator with a KLL sketch with k=200
+ */
+ public static KllHistogramEstimator getEmptyHistogramEstimator() {
Review Comment:
there are 3 `getEmptyHistogramEstimator` -s from which only 2 is being used;
and I think 1 will be too much already...do we need these at all? what's the
point of introducing these???
##########
ql/src/gen/vectorization/UDAFTemplates/VectorUDAFComputeKLL.txt:
##########
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates;
+
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimator;
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimatorFactory;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+#IF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.<InputColumnVectorType>;
+#ENDIF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationBufferRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationDesc;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+/**
+ * Generated from template VectorUDAFComputeKLL.txt.
Review Comment:
this seems to be the vectorized pair of `ds_kll_sketch` or not?
can we keep the naming conventions? it will be hard to follow
what-is-what....without doing so
##########
ql/src/gen/vectorization/UDAFTemplates/VectorUDAFComputeKLL.txt:
##########
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates;
+
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimator;
+import org.apache.hadoop.hive.common.histogram.KllHistogramEstimatorFactory;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+#IF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.<InputColumnVectorType>;
+#ENDIF COMPLETE
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationBufferRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorAggregationDesc;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+
+/**
+ * Generated from template VectorUDAFComputeKLL.txt.
+ */
+@Description(name = "ds_kll_sketch", value = "_FUNC_(x) "
+ + "Returns a KllFloatsSketch in a serialized form as a binary blob."
+ + " Values must be of type float.")
+public class <ClassName> extends VectorAggregateExpression {
+
+ public <ClassName>() {
+ super();
+ }
+
+ public <ClassName>(VectorAggregationDesc vecAggrDesc) {
+ super(vecAggrDesc);
+ }
+
+ @Override
+ public AggregationBuffer getNewAggregationBuffer() throws HiveException {
+ return new Aggregation();
+ }
+
+ @Override
+ public void aggregateInput(AggregationBuffer agg, VectorizedRowBatch batch)
throws HiveException {
+ inputExpression.evaluate(batch);
+
+#IF COMPLETE
+ <InputColumnVectorType> inputColumn = (<InputColumnVectorType>)
batch.cols[this.inputExpression.getOutputColumnNum()];
+#ENDIF COMPLETE
+#IF MERGING
+ BytesColumnVector inputColumn = (BytesColumnVector)
batch.cols[this.inputExpression.getOutputColumnNum()];
+#ENDIF MERGING
+
+ int batchSize = batch.size;
+
+ if (batchSize == 0) {
+ return;
+ }
+
+ Aggregation myagg = (Aggregation) agg;
+
+#IF COMPLETE
+ myagg.prepare();
+ if (inputColumn.noNulls) {
+ if (inputColumn.isRepeating) {
+ for (int i = 0; i < batchSize; i++) {
+ myagg.estimator.addToEstimator(inputColumn.vector[0]);
+ }
+ } else {
+ if (batch.selectedInUse) {
+ for (int s = 0; s < batchSize; s++) {
+ int i = batch.selected[s];
+ myagg.estimator.addToEstimator(inputColumn.vector[i]);
+ }
+ } else {
+ for (int i = 0; i < batchSize; i++) {
+ myagg.estimator.addToEstimator(inputColumn.vector[i]);
+ }
+ }
+ }
+ } else {
+ if (inputColumn.isRepeating) {
+ if (!inputColumn.isNull[0]) {
+ for (int i = 0; i < batchSize; i++) {
+ myagg.estimator.addToEstimator(inputColumn.vector[0]);
+ }
+ }
+ } else {
+ if (batch.selectedInUse) {
+ for (int j = 0; j < batchSize; ++j) {
+ int i = batch.selected[j];
+ if (!inputColumn.isNull[i]) {
+ myagg.estimator.addToEstimator(inputColumn.vector[i]);
+ }
+ }
+ } else {
+ for (int i = 0; i < batchSize; i++) {
+ if (!inputColumn.isNull[i]) {
+ myagg.estimator.addToEstimator(inputColumn.vector[i]);
+ }
+ }
+ }
+ }
+ }
+#ENDIF COMPLETE
+#IF MERGING
+ if (inputColumn.isRepeating) {
+ if (!inputColumn.isNull[0] && inputColumn.length[0] > 0) {
+ myagg.prepare();
+ KllHistogramEstimator mergingKLL =
KllHistogramEstimatorFactory.getKllHistogramEstimator(
Review Comment:
do we really have to create and bind 1-by-1 vectorized implementations for
the datasketches functions ?
function classes kinda do the same stuff... like `DATA_TO_SKETCH` ;
Issue Time Tracking
-------------------
Worklog Id: (was: 776193)
Remaining Estimate: 0h
Time Spent: 10m
> Add vectorized implementation of the 'ds_kll_sketch' UDAF
> ---------------------------------------------------------
>
> Key: HIVE-26243
> URL: https://issues.apache.org/jira/browse/HIVE-26243
> Project: Hive
> Issue Type: Improvement
> Components: UDF, Vectorization
> Affects Versions: 4.0.0-alpha-2
> Reporter: Alessandro Solimando
> Assignee: Alessandro Solimando
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> _ds_kll_sketch_ UDAF does not have a vectorized implementation at the moment,
> the present ticket aims at bridging this gap.
> This is particularly important because vectorization has an "all or nothing"
> approach, so if this function is used at the side of vectorized functions,
> they won't be able to benefit from vectorized execution.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)