Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19479#discussion_r150391475
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
---
@@ -275,6 +313,127 @@ object ColumnStat extends Logging {
avgLen = row.getLong(4),
maxLen = row.getLong(5)
)
+ if (row.isNullAt(6)) {
+ cs
+ } else {
+ val ndvs = row.getArray(6).toLongArray()
+ assert(percentiles.get.numElements() == ndvs.length + 1)
+ val endpoints =
percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
+ // Construct equi-height histogram
+ val bins = ndvs.zipWithIndex.map { case (ndv, i) =>
+ HistogramBin(endpoints(i), endpoints(i + 1), ndv)
+ }
+ val nonNullRows = rowCount - cs.nullCount
+ val histogram = Histogram(nonNullRows.toDouble / ndvs.length, bins)
+ cs.copy(histogram = Some(histogram))
+ }
+ }
+
+}
+
+/**
+ * This class is an implementation of equi-height histogram.
+ * Equi-height histogram represents the distribution of a column's values
by a sequence of bins.
+ * Each bin has a value range and contains approximately the same number
of rows.
+ * @param height number of rows in each bin
+ * @param bins equi-height histogram bins
+ */
+case class Histogram(height: Double, bins: Array[HistogramBin]) {
+
+ // Only for histogram equality test.
+ override def equals(other: Any): Boolean = other match {
+ case otherHgm: Histogram =>
+ height == otherHgm.height && bins.sameElements(otherHgm.bins)
+ case _ => false
+ }
+
+ override def hashCode(): Int = {
+ val temp = java.lang.Double.doubleToLongBits(height)
+ var result = (temp ^ (temp >>> 32)).toInt
+ result = 31 * result +
java.util.Arrays.hashCode(bins.asInstanceOf[Array[AnyRef]])
+ result
+ }
+}
+
+/**
+ * A bin in an equi-height histogram. We use double type for lower/higher
bound for simplicity.
+ * @param lo lower bound of the value range in this bin
+ * @param hi higher bound of the value range in this bin
+ * @param ndv approximate number of distinct values in this bin
+ */
+case class HistogramBin(lo: Double, hi: Double, ndv: Long)
+
+object HistogramSerializer {
+ /**
+ * Serializes a given histogram to a string. For advanced statistics
like histograms, sketches,
+ * etc, we don't provide readability for their serialized formats in
metastore
+ * (string-to-string table properties). This is because it's hard or
unnatural for these
+ * statistics to be human readable. For example, a histogram usually
cannot fit in a single,
+ * self-described property. And for count-min-sketch, it's essentially
unnatural to make it
+ * a readable string.
+ */
+ final def serialize(histogram: Histogram): String = {
+ val bos = new ByteArrayOutputStream()
+ val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
+ out.writeDouble(histogram.height)
+ out.writeInt(histogram.bins.length)
+ // Write data with same type together for compression.
+ var i = 0
+ while (i < histogram.bins.length) {
+ out.writeDouble(histogram.bins(i).lo)
+ i += 1
+ }
+ i = 0
+ while (i < histogram.bins.length) {
+ out.writeDouble(histogram.bins(i).hi)
+ i += 1
+ }
+ i = 0
+ while (i < histogram.bins.length) {
+ out.writeLong(histogram.bins(i).ndv)
+ i += 1
+ }
+ out.writeInt(-1)
+ out.flush()
+ out.close()
+
+
org.apache.commons.codec.binary.Base64.encodeBase64String(bos.toByteArray)
--- End diff --
cc @rxin , the histogram is not very human-readable anyway, and it wastes a
lot of spaces if we use plain string to represent it, which may easily hit hive
metastore limitation(4k value length of table property). Here we pick a stable
encoding(bin1_low, bin2_low, ... bin1_high, bin2_high, ... bin1_ndv, bin2_ndv,
...) and turn the binary to base64string, to save space.
As long as we don't change the histogram implementation, this approach
won't have backward compatibility issuse. If we do wanna change the
implementation, we can treat it as a new statistics.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]