[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

cloud-fan Sat, 11 Nov 2017 11:29:35 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150391475
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
    @@ -275,6 +313,127 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = 
percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val bins = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBin(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, bins)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values 
by a sequence of bins.
    + * Each bin has a value range and contains approximately the same number 
of rows.
    + * @param height number of rows in each bin
    + * @param bins equi-height histogram bins
    + */
    +case class Histogram(height: Double, bins: Array[HistogramBin]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherHgm: Histogram =>
    +      height == otherHgm.height && bins.sameElements(otherHgm.bins)
    +    case _ => false
    +  }
    +
    +  override def hashCode(): Int = {
    +    val temp = java.lang.Double.doubleToLongBits(height)
    +    var result = (temp ^ (temp >>> 32)).toInt
    +    result = 31 * result + 
java.util.Arrays.hashCode(bins.asInstanceOf[Array[AnyRef]])
    +    result
    +  }
    +}
    +
    +/**
    + * A bin in an equi-height histogram. We use double type for lower/higher 
bound for simplicity.
    + * @param lo lower bound of the value range in this bin
    + * @param hi higher bound of the value range in this bin
    + * @param ndv approximate number of distinct values in this bin
    + */
    +case class HistogramBin(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics 
like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in 
metastore
    +   * (string-to-string table properties). This is because it's hard or 
unnatural for these
    +   * statistics to be human readable. For example, a histogram usually 
cannot fit in a single,
    +   * self-described property. And for count-min-sketch, it's essentially 
unnatural to make it
    +   * a readable string.
    +   */
    +  final def serialize(histogram: Histogram): String = {
    +    val bos = new ByteArrayOutputStream()
    +    val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
    +    out.writeDouble(histogram.height)
    +    out.writeInt(histogram.bins.length)
    +    // Write data with same type together for compression.
    +    var i = 0
    +    while (i < histogram.bins.length) {
    +      out.writeDouble(histogram.bins(i).lo)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.bins.length) {
    +      out.writeDouble(histogram.bins(i).hi)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.bins.length) {
    +      out.writeLong(histogram.bins(i).ndv)
    +      i += 1
    +    }
    +    out.writeInt(-1)
    +    out.flush()
    +    out.close()
    +
    +    
org.apache.commons.codec.binary.Base64.encodeBase64String(bos.toByteArray)
    --- End diff --
    
    cc @rxin , the histogram is not very human-readable anyway, and it wastes a 
lot of spaces if we use plain string to represent it, which may easily hit hive 
metastore limitation(4k value length of table property). Here we pick a stable 
encoding(bin1_low, bin2_low, ... bin1_high, bin2_high, ... bin1_ndv, bin2_ndv, 
...) and turn the binary to base64string, to save space.
    
    As long as we don't change the histogram implementation, this approach 
won't have backward compatibility issuse. If we do wanna change the 
implementation, we can treat it as a new statistics.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Reply via email to