Hi Shivani,

You misunderstand the parameter of SparseVector.

class SparseVector(
    override val size: Int,
    val indices: Array[Int],
    val values: Array[Double]) extends Vector {
}

The first parameter is the total length of the Vector rather than the
length of non-zero elements.
So it need greater than the maximum non-zero element index which is 21 in
your case.
The following code can work:

val doc1s = new IndexedRow(1L, new SSV(22, Array(1, 3, 5, 7),Array(1.0,
1.0, 0.0, 5.0)))
val doc2s = new IndexedRow(2L, new SSV(22, Array(1, 2, 4, 13), Array(0.0,
1.0, 2.0, 0.0)))
val doc3s = new IndexedRow(3L, new SSV(22, Array(10, 14, 20, 21),Array(2.0,
0.0, 2.0, 1.0)))
val doc4s = new IndexedRow(4L, new SSV(22, Array(3, 7, 13, 20),Array(2.0,
0.0, 2.0, 1.0)))

2014-11-26 10:09 GMT+08:00 Shivani Rao <raoshiv...@gmail.com>:

> Hello Spark fans,
>
> I am trying to use the IDF model available in the spark mllib to create an
> tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE
>
> I get the following error
>
> "java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
> at breeze.linalg.DenseVector.apply$mcI$sp(DenseVector.scala:70)
> at breeze.linalg.DenseVector.apply(DenseVector.scala:69)
> at
> org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.add(IDF.scala:81)
> "
>
> Any ideas?
>
> Regards,
> Shivani
>
> import org.apache.spark.mllib.feature.VectorTransformer
>
> import com.box.analytics.ml.dms.vector.{SparkSparseVector,SparkDenseVector}
>
> import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector =>
> SSV}
>
> import org.apache.spark.mllib.linalg.{Vector => SparkVector}
>
> import org.apache.spark.mllib.linalg.distributed.{IndexedRow,
> IndexedRowMatrix}
>
> import org.apache.spark.mllib.feature._
>
>
>     val doc1s = new IndexedRow(1L, new SSV(4, Array(1, 3, 5, 7),Array(1.0,
> 1.0, 0.0, 5.0)))
>
>     val doc2s = new IndexedRow(2L, new SSV(4, Array(1, 2, 4, 13),
> Array(0.0, 1.0, 2.0, 0.0)))
>
>     val doc3s = new IndexedRow(3L, new SSV(4, Array(10, 14, 20,
> 21),Array(2.0, 0.0, 2.0, 1.0)))
>
>     val doc4s = new IndexedRow(4L, new SSV(4, Array(3, 7, 13,
> 20),Array(2.0, 0.0, 2.0, 1.0)))
>
>  val indata =
> sc.parallelize(List(doc1s,doc2s,doc3s,doc4s)).map(e=>e.vector)
>
> (new IDF()).fit(indata).idf
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>

Reply via email to