[ 
https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-4902:
-------------------------------
    Description: 
{{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that 
contains an array or a iterator(when the memory is not enough). 
The GapSamplingIterator implementation is as follows
{code}
private val iterDrop: Int => Unit = {
    val arrayClass = Array.empty[T].iterator.getClass
    val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
    data.getClass match {
      case `arrayClass` => ((n: Int) => { data = data.drop(n) })
      case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
      case _ => ((n: Int) => {
          var j = 0
          while (j < n && data.hasNext) {
            data.next()
            j += 1
          }
        })
    }
  }
{code}

The code does not deal with InterruptibleIterator.
This leads to the following code can't use the {{Iterator.drop}} method
{code}
rdd.cache()
rdd.sample(false,0.1)
{code}


  was:
{{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that 
contains an array or a iterator(when the memory is not enough). 
The GapSamplingIterator implementation is as follows
{code}
private val iterDrop: Int => Unit = {
    val arrayClass = Array.empty[T].iterator.getClass
    val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
    data.getClass match {
      case `arrayClass` => ((n: Int) => { data = data.drop(n) })
      case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
      case _ => ((n: Int) => {
          var j = 0
          while (j < n && data.hasNext) {
            data.next()
            j += 1
          }
        })
    }
  }
{code}

The code does not deal with InterruptibleIterator.
This leads to the following code can't use the {{Iterator.drop}} method
{code}
rdd.cache()
data.sample(false,0.1)
{code}



> gap-sampling performance optimization
> -------------------------------------
>
>                 Key: SPARK-4902
>                 URL: https://issues.apache.org/jira/browse/SPARK-4902
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Guoqiang Li
>
> {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator 
> that contains an array or a iterator(when the memory is not enough). 
> The GapSamplingIterator implementation is as follows
> {code}
> private val iterDrop: Int => Unit = {
>     val arrayClass = Array.empty[T].iterator.getClass
>     val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
>     data.getClass match {
>       case `arrayClass` => ((n: Int) => { data = data.drop(n) })
>       case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
>       case _ => ((n: Int) => {
>           var j = 0
>           while (j < n && data.hasNext) {
>             data.next()
>             j += 1
>           }
>         })
>     }
>   }
> {code}
> The code does not deal with InterruptibleIterator.
> This leads to the following code can't use the {{Iterator.drop}} method
> {code}
> rdd.cache()
> rdd.sample(false,0.1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to