[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

liancheng Thu, 02 Jul 2015 11:21:13 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7176#discussion_r33807115
  
    --- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala ---
    @@ -0,0 +1,154 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql
    +
    +import java.lang.Double.longBitsToDouble
    +import java.lang.Float.intBitsToFloat
    +
    +import scala.util.Random
    +
    +import org.apache.spark.sql.types._
    +
    +/**
    + * Random data generators for Spark SQL DataTypes. These generators do not 
generate uniformly random
    + * values; instead, they're biased to return "interesting" values (such as 
maximum / minimum values)
    + * with higher probability.
    + */
    +object RandomDataGenerator {
    +
    +  /**
    +   * The conditional probability of a non-null value being drawn from a 
set of "interesting" values
    +   * instead of being chosen uniformly at random.
    +   */
    +  private val PROBABILITY_OF_INTERESTING_VALUE: Float = 0.5f
    +
    +  /**
    +   * The probability of the generated value being null
    +   */
    +  private val PROBABILITY_OF_NULL: Float = 0.1f
    +
    +  private val MAX_STR_LEN: Int = 1024
    +  private val MAX_ARR_SIZE: Int = 128
    +  private val MAX_MAP_SIZE: Int = 128
    +
    +  /**
    +   * Helper function for constructing a biased random number generator 
which returns "interesting"
    +   * values with a higher probability.
    +   */
    +  private def randomNumeric[T](
    +      rand: Random,
    +      uniformRand: Random => T,
    +      interestingValues: Seq[T]): Some[() => T] = {
    +    val f = () => {
    +      if (rand.nextFloat() <= PROBABILITY_OF_INTERESTING_VALUE) {
    +        interestingValues(rand.nextInt(interestingValues.length))
    +      } else {
    +        uniformRand(rand)
    +      }
    +    }
    +    Some(f)
    +  }
    +
    +  /**
    +   * Returns a function which generates random values for the given 
[[DataType]], or `None` if no
    +   * random data generator is defined for that data type. The generated 
values will use an external
    +   * representation of the data type; for example, the random generator 
for [[DateType]] will return
    +   * instances of [[java.sql.Date]] and the generator for [[StructType]] 
will return a
    +   * [[org.apache.spark.Row]].
    +   *
    +   * @param dataType the type to generate values for
    +   * @param nullable whether null values should be generated
    +   * @param seed an optional seed for the random number generator
    +   * @return a function which can be called to generate random values.
    +   */
    +  def forType(
    +      dataType: DataType,
    +      nullable: Boolean = true,
    +      seed: Option[Long] = None): Option[() => Any] = {
    +    val rand = new Random()
    +    seed.foreach(rand.setSeed)
    +
    +    val valueGenerator: Option[() => Any] = dataType match {
    +      case StringType => Some(() => 
rand.nextString(rand.nextInt(MAX_STR_LEN)))
    +      case BinaryType => Some(() => {
    +        val arr = new Array[Byte](rand.nextInt(MAX_STR_LEN))
    +        rand.nextBytes(arr)
    +        arr
    +      })
    +      case BooleanType => Some(() => rand.nextBoolean())
    +      case DateType => Some(() => new 
java.sql.Date(rand.nextInt(Int.MaxValue)))
    +      case DoubleType => randomNumeric[Double](
    +        rand, r => longBitsToDouble(r.nextLong()), Seq(Double.MinValue, 
Double.MinPositiveValue,
    --- End diff --
    
    Are we using `longBitsToDouble` for better performance? Quoted from 
`longBitsToDouble` Javadoc:
    
    ```
         * <p>If the argument is any value in the range
         * {@code 0x7ff0000000000001L} through
         * {@code 0x7fffffffffffffffL} or in the range
         * {@code 0xfff0000000000001L} through
         * {@code 0xffffffffffffffffL}, the result is a NaN.  No IEEE
         * 754 floating-point operation provided by Java can distinguish
         * between two NaN values of the same type with different bit
         * patterns.  Distinct values of NaN are only distinguishable by
         * use of the {@code Double.doubleToRawLongBits} method.
    ```
    
    This implies more chances than expected to generate NaNs. But it's probably 
OK?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

Reply via email to