Github user rxin commented on a diff in the pull request:
https://github.com/apache/spark/pull/10851#discussion_r50351683
--- Diff:
common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java ---
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * An implementation of Count-Min sketch data structure for the following
data types:
+ * <ul>
+ * <li>{@link Byte}</li>
+ * <li>{@link Short}</li>
+ * <li>{@link Integer}</li>
+ * <li>{@link Long}</li>
+ * <li>{@link String}</li>
+ * </ul>
+ * A Count-Min sketch is a probabilistic data structure used for
summarizing streams of data in
+ * sub-linear space. Each {@link CountMinSketch} is initialized with a
random seed, and a pair
+ * of parameters:
+ * <ol>
+ * <li>relative error (or {@code eps}), and
+ * <li>confidence (or {@code delta})
+ * </ol>
+ * Suppose you want to estimate the number of times an element {@code x}
has appeared in a data
+ * stream so far. With probability {@code delta}, the estimate of this
frequency is within the
+ * range {@code true frequency <= estimate <= true frequency + eps * N},
where {@code N} is the
+ * total count of items have appeared the the data stream so far.
+ *
+ * Under the cover, a {@link CountMinSketch} is essentially a
two-dimensional {@code long} array
+ * with depth {@code d} and width {@code w}, where
+ * <ul>
+ * <li>{@code d = ceil(2 / eps)}</li>
+ * <li>{@code w = ceil(-log(1 - confidence) / log(2))}</li>
+ * </ul>
+ *
+ * See http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf for
technical details,
+ * including proofs of the estimates and error bounds used in this
implementation.
+ *
+ * This implementation is largely based on the {@code CountMinSketch}
class from stream-lib.
+ */
+abstract public class CountMinSketch {
+ /**
+ * Returns the relative error (or {@code eps}) of this {@link
CountMinSketch}.
+ */
+ public abstract double relativeError();
+
+ /**
+ * Returns the confidence (or {@code delta}) of this {@link
CountMinSketch}.
+ */
+ public abstract double confidence();
+
+ /**
+ * Depth of this {@link CountMinSketch}.
+ */
+ public abstract int depth();
+
+ /**
+ * Width of this {@link CountMinSketch}.
+ */
+ public abstract int width();
+
+ /**
+ * Total count of items added to this {@link CountMinSketch} so far.
+ */
+ public abstract long totalCount();
+
+ /**
+ * Adds 1 to {@code item}.
+ */
+ public abstract void add(Object item);
+
+ /**
+ * Adds {@code count} to {@code item}.
+ */
+ public abstract void add(Object item, long count);
+
+ /**
+ * Returns the estimated frequency of {@code item}.
+ */
+ public abstract long estimateCount(Object item);
+
+ /**
+ * Merges another {@link CountMinSketch} with this one in place.
+ *
+ * Note that only Count-Min sketches with the same {@code depth}, {@code
width}, and random seed
+ * can be merged.
+ */
+ public abstract CountMinSketch mergeInPlace(CountMinSketch other);
--- End diff --
declare that this could throw some exception?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]