Re: RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
The implicit rankings are the output of Tf-idf. I.e.:
Each_ranking= frecuency of an ítem * log(amount of total customers/amount
of customers buying the ítem)

El 14 sept. 2016 17:14, "Sean Owen" <so...@cloudera.com> escribió:

> What are implicit rankings here?
> RMSE would not be an appropriate measure for comparing rankings. There are
> ranking metrics like mean average precision that would be appropriate
> instead.
>
> On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani <
> pasquinell.urb...@exalitica.com> wrote:
>
>> It was a typo mistake, both are rmse.
>>
>> The frecency distribution of rankings is the following
>>
>> [image: Imágenes integradas 2]
>>
>> As you can see, I have heavy tail, but the majority of the observations
>> rely near ranking  5.
>>
>> I'm working with implicit rankings (generated by TF-IDF), can this affect
>> the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)
>>
>> Thank you.
>>
>>
>>
>> 2016-09-14 16:49 GMT-03:00 Sean Owen <so...@cloudera.com>:
>>
>>> There is no way to answer this without knowing what your inputs are
>>> like. If they're on the scale of thousands, that's small (good). If
>>> they're on the scale of 1-5, that's extremely poor.
>>>
>>> What's RMS vs RMSE?
>>>
>>> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
>>> <pasquinell.urb...@exalitica.com> wrote:
>>> > Hi Community
>>> >
>>> > I'm performing an ALS for retail product recommendation. Right now I'm
>>> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
>>> > experience? Does the transformation of the ranking values important for
>>> > having good errors?
>>> >
>>> > Thank you all.
>>> >
>>> > Pasquinell Urbani
>>>
>>
>>
>


Re: RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
It was a typo mistake, both are rmse.

The frecency distribution of rankings is the following

[image: Imágenes integradas 2]

As you can see, I have heavy tail, but the majority of the observations
rely near ranking  5.

I'm working with implicit rankings (generated by TF-IDF), can this affect
the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)

Thank you.



2016-09-14 16:49 GMT-03:00 Sean Owen <so...@cloudera.com>:

> There is no way to answer this without knowing what your inputs are
> like. If they're on the scale of thousands, that's small (good). If
> they're on the scale of 1-5, that's extremely poor.
>
> What's RMS vs RMSE?
>
> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
> <pasquinell.urb...@exalitica.com> wrote:
> > Hi Community
> >
> > I'm performing an ALS for retail product recommendation. Right now I'm
> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
> > experience? Does the transformation of the ranking values important for
> > having good errors?
> >
> > Thank you all.
> >
> > Pasquinell Urbani
>


RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
Hi Community

I'm performing an ALS for retail product recommendation. Right now I'm
reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
experience? Does the transformation of the ranking values important for
having good errors?

Thank you all.

Pasquinell Urbani


Perform an ALS with TF-IDF output (spark 2.0)

2016-08-25 Thread Pasquinell Urbani
Hi there

I am performing a product recommendation system for retail. I have been
able to compute the TF-IDF of user-items data frame in spark 2.0.

Now I need to transform the TF-IDF output in a data frame with columns
(user_id, item_id, TF_IDF_ratings) in order to perform an ALS. But I have
no clue how to do it.

Can anybody give me some help?

Thank you all.


Re: QuantileDiscretizer not working properly with big dataframes

2016-07-12 Thread Pasquinell Urbani
In the forum mentioned above the flowing solution is suggested

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be
fixed by changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 1)
after: val requiredSamples = math.max(numBins * numBins, 1.0)

Is there another way?


2016-07-11 18:28 GMT-04:00 Pasquinell Urbani <
pasquinell.urb...@exalitica.com>:

> Hi all,
>
> We have a dataframe with 2.5 millions of records and 13 features. We want
> to perform a logistic regression with this data but first we neet to divide
> each columns in discrete values using QuantileDiscretizer. This will
> improve the performance of the model by avoiding outliers.
>
> For small dataframes QuantileDiscretizer works perfect (see the ml
> example:
> https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
> but for large data frames it tends to split the column in only the values 0
> and 1 (despite the custom number of buckets is settled in to 5). Here is my
> code:
>
> val discretizer = new QuantileDiscretizer()
>   .setInputCol("C4")
>   .setOutputCol("C4_Q")
>   .setNumBuckets(5)
>
> val result = discretizer.fit(df3).transform(df3)
> result.show()
>
> I found the same problem presented here:
> https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
> solution yet.
>
> Do I am configuring the function in a bad way? Should I pre-process the
> data (like z-scores)? Can somebody help me dealing with this?
>
> Regards
>


QuantileDiscretizer not working properly with big dataframes

2016-07-11 Thread Pasquinell Urbani
Hi all,

We have a dataframe with 2.5 millions of records and 13 features. We want
to perform a logistic regression with this data but first we neet to divide
each columns in discrete values using QuantileDiscretizer. This will
improve the performance of the model by avoiding outliers.

For small dataframes QuantileDiscretizer works perfect (see the ml example:
https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
but for large data frames it tends to split the column in only the values 0
and 1 (despite the custom number of buckets is settled in to 5). Here is my
code:

val discretizer = new QuantileDiscretizer()
  .setInputCol("C4")
  .setOutputCol("C4_Q")
  .setNumBuckets(5)

val result = discretizer.fit(df3).transform(df3)
result.show()

I found the same problem presented here:
https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
solution yet.

Do I am configuring the function in a bad way? Should I pre-process the
data (like z-scores)? Can somebody help me dealing with this?

Regards


Iterate over columns in sql.dataframe

2016-07-08 Thread Pasquinell Urbani
Hi all

I need to apply QuantileDiscretizer() over a 16 columns sql.dataframe.
Which is the most efficient way to apply a function over each columns? Do I
need to iterate over columns? Which is the best way to do this?

Thank you all.


Change from distributed.MatrixEntry to Vector

2016-06-23 Thread Pasquinell Urbani
Hello all,

I have to build a item-based recommendation system. First I obtained the
similarity matrix with CosineSimilarity DIMSUM by twitter solution (
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum). The
similarity matrix is in the following format:
org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.distributed.MatrixEntry].
The matrix name is simsEstimate, and is obtained from the following code:


val R = sc.parallelize(0 until M, NUMCHUNKS).flatMap{i =>
  val inds = new scala.collection.mutable.TreeSet[Int]()
  while (inds.size < NNZ) {
inds += scala.util.Random.nextInt(U)
  }
  inds.toArray.map(j => MatrixEntry(i, j, scala.math.random))
}
val mat = new CoordinateMatrix(R, M, U).toRowMatrix()

val simsEstimate = mat.columnSimilarities(0.8)


 After this, I need to perform a ElementwiseProduct involving the columns
of the similarity matrix. But it needs to be
in org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] format.


Can anybody tell me how to manipulate a MatrixEntry format in order to
obtain their component vector to be in org.apache.spark.rdd.RDD[org.
apache.spark.mllib.linalg.Vector] format?


TFIDF question

2016-05-23 Thread Pasquinell Urbani
Hi all,

I'm following an TF-IDF example but I’m having some issues that i’m not
sure how to fix.

The input is the following

val test = sc.textFile("s3n://.../test_tfidf_products.txt")
test.collect.mkString("\n")

which prints

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at textFile
at :121 res241: String = a a b c d e b c d d

After that, I compute the ratings by doing

val test2 = test.map(_.split(" ").toSeq)
val hashingTF2 = new HashingTF()
val tf2: RDD[Vector] = hashingTF2.transform(test2)
tf2.cache()
val idf2 = new IDF().fit(tf2)
val tfidf2: RDD[Vector] = idf2.transform(tf2)
val expandedText = idfModel.transform(tf)
tfidf2.collect.mkString("\n")

which prints

(1048576,[97,98,99,100,101],[0.8109302162163288,0.0,0.0,0.0,0.4054651081081644])
(1048576,[98,99,100],[0.0,0.0,0.0])

The numbers [97,98,99,100,101] are indexes of the vector tfidf2.

I need to access the rating for example for item “a”, but the only way i
have been able to do this is using the method indexOf() of the hasingTF
object.

hashingTF2.indexOf("a")

res236: Int = 97


Is there a better way to perform this?


Thank you all.


Problems finding the original objects after HashingTF()

2016-05-20 Thread Pasquinell Urbani
Hi all,

I'm following an TF-IDF example but I’m having some issues that i’m not
sure how to fix.

The input is the following

val test = sc.textFile("s3n://.../test_tfidf_products.txt")
test.collect.mkString("\n")

which prints

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at textFile
at :121 res241: String = a a b c d e b c d d

After that, I compute the ratings by doing

val test2 = test.map(_.split(" ").toSeq)
val hashingTF2 = new HashingTF()
val tf2: RDD[Vector] = hashingTF2.transform(test2)
tf2.cache()
val idf2 = new IDF().fit(tf2)
val tfidf2: RDD[Vector] = idf2.transform(tf2)
val expandedText = idfModel.transform(tf)
tfidf2.collect.mkString("\n")

which prints

(1048576,[97,98,99,100,101],[0.8109302162163288,0.0,0.0,0.0,0.4054651081081644])
(1048576,[98,99,100],[0.0,0.0,0.0])

The numbers [97,98,99,100,101] are indexes of the vector tfidf2.

I need to access the rating for example for item “a”, but the only way i
have been able to do this is using the method indexOf() of the hasingTF
object.

hashingTF2.indexOf("a")

res236: Int = 97


Is there a better way to perform this?


Thank you all.