[jira] [Commented] (MAHOUT-1489) Interactive Scala Spark Bindings Shell Script processor

2014-03-27 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950290#comment-13950290
 ] 

Dmitriy Lyubimov commented on MAHOUT-1489:
--

yeah.
reallistically, functionality-wise it is not that much we need to add
here. It is basic Spark shell +

(1) with Mahout classpath of mahout-spark and its transitives added in
addition to Spark stuff;
(2) importing our standard things automatically (i.e.
o.a.m.sparkbidings._, o.a.m.sparkbindings.drm._, RLikeDrmOps._ etc per
manual -- make that default package imports easily to add to as we add
e.g. data frames dsl).

This is not that much, no fundamental hacks are required. In fact, i
have done (2)-like things a lot with standard scala interpreter. In
our case we of course cannot use standard scala itnterpreter because
we need Spark to sync whatever new closures we put into script, with
the backend, for us. But we probably can just inherit from Spark
interpreter and then modify its automatic imports. The classpath
issues shuold be handled by mahout.sh script.




 Interactive Scala  Spark Bindings Shell  Script processor
 ---

 Key: MAHOUT-1489
 URL: https://issues.apache.org/jira/browse/MAHOUT-1489
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 1.0
Reporter: Saikat Kanjilal
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Build an interactive shell /scripting (just like spark shell). Something very 
 similar in R interactive/script runner mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1493) Port Naive Bayes to the Spark DSL

2014-03-27 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950283#comment-13950283
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1493 at 3/28/14 2:30 AM:
---

I don't think you meant run() to return Unit. 

Also I am not sure using a class is justified.

In most cases, i would favor dropping classes in favor of functions, albeit 
with fairly long parameter list but populaed with default values.

The pattern i am following is to create a pithy and expressive name (such as 
ssvd()) for a function (in this case could be something like trainNB) inside 
a scala object (singleton) and then re-translate that as top-level package 
function so one can say something like 

{code}
import decompositions._
val nbmodel = trainNB(...)
...
{code}





was (Author: dlyubimov):
I don't think you meant run() to return Unit. 

Also I am not sure using a class is justified.

In most cases, i would favor dropping classes in favor of functions, albeit 
with fairly long parameter list but populaed with default values.

The pattern i am following is (1) to create a pithy and expressive name (such 
as ssvd()) for a function (in this case could be something like trainNB) 
inside a scala object (singleton) and then re-translate that as top-level 
package function so one can say something like 

{code}
import decompositions._
val nbmodel = trainNB(...)
...
{code}


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL

2014-03-27 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950293#comment-13950293
 ] 

Dmitriy Lyubimov commented on MAHOUT-1493:
--

PS. it's an R naming style. R almost never exposes api as classes (and, 
frankly, R classes -- even the latest generation -- is embarassment compared to 
everything else in existence).

Classes are usually needed if there's a state, and we already have that state 
as the bayes model object, don't we?

 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1489) Interactive linear algebra shell script processor

2014-03-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov reassigned MAHOUT-1489:


Assignee: Dmitriy Lyubimov

 Interactive linear algebra shell  script processor
 ---

 Key: MAHOUT-1489
 URL: https://issues.apache.org/jira/browse/MAHOUT-1489
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 1.0
Reporter: Saikat Kanjilal
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Build an interactive shell /scripting (just like spark shell). Something very 
 similar in R interactive/script runner mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1489) Interactive linear algebra shell script processor

2014-03-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948276#comment-13948276
 ] 

Dmitriy Lyubimov commented on MAHOUT-1489:
--

I cannot assign to a non-committer, so i will be watching it with assumption 
the patch is coming from Saikat. (that was condition of creating a new Jira).

 Interactive linear algebra shell  script processor
 ---

 Key: MAHOUT-1489
 URL: https://issues.apache.org/jira/browse/MAHOUT-1489
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 1.0
Reporter: Saikat Kanjilal
 Fix For: 1.0


 Build an interactive shell /scripting (just like spark shell). Something very 
 similar in R interactive/script runner mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-03-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948280#comment-13948280
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

Very good. I guess we need a DSL proposal from someone intimately familiar with 
R data frames. (I guess i qualify for that but i am probably not going to have 
enough time).

 Data frame R-like bindings
 --

 Key: MAHOUT-1490
 URL: https://issues.apache.org/jira/browse/MAHOUT-1490
 Project: Mahout
  Issue Type: New Feature
Reporter: Saikat Kanjilal
   Original Estimate: 20h
  Remaining Estimate: 20h

 Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1490) Data frame R-like bindings

2014-03-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov reassigned MAHOUT-1490:


Assignee: Dmitriy Lyubimov

 Data frame R-like bindings
 --

 Key: MAHOUT-1490
 URL: https://issues.apache.org/jira/browse/MAHOUT-1490
 Project: Mahout
  Issue Type: New Feature
Reporter: Saikat Kanjilal
Assignee: Dmitriy Lyubimov
   Original Estimate: 20h
  Remaining Estimate: 20h

 Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1489) Interactive Scala Spark Bindings Shell Script processor

2014-03-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1489:
-

Summary: Interactive Scala  Spark Bindings Shell  Script processor  (was: 
Interactive linear algebra shell  script processor)

 Interactive Scala  Spark Bindings Shell  Script processor
 ---

 Key: MAHOUT-1489
 URL: https://issues.apache.org/jira/browse/MAHOUT-1489
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 1.0
Reporter: Saikat Kanjilal
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Build an interactive shell /scripting (just like spark shell). Something very 
 similar in R interactive/script runner mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-23 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944755#comment-13944755
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' 
dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 
values but that seems smallish. Can't say when I'll get to it but it's on my 
list. If someone can jump in quicker--have at it.

@Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for 
the flops alone. Did your original version also used matrix squaring? How did 
it fare?

Also, since the flops grow power-law w.r.t input size (it is a problem for 
ssvd, too) we may need to contemplate a technique that creates finer splits for 
such computations based on input size. It very well may be the case that 
original hdfs splits may turn out to be too large for adequate load 
redistribution.

Technically, it is extremely simple -- we'd just have to insert a physical 
operator tweaking RDD splits via shuffless coalesce() which also costs 
nothing in Spark. However, i am not sure what would be sensible API for this -- 
automatic, semi-automatic cost-based...  

I guess one brainless thing to do is to parameterize drmContext with desired 
parallelism (~cluster task capacity) and have optimizer to insert physical 
opertors that very # of partitions and do automatic shuffless coalesce if the 
number is too low

any thoughts?


 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-23 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944755#comment-13944755
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1464 at 3/24/14 5:10 AM:
---

bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' 
dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 
values but that seems smallish. Can't say when I'll get to it but it's on my 
list. If someone can jump in quicker--have at it.

@Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for 
the flops alone. Did your original version also used matrix squaring? How did 
it fare?

Also, since the flops grow power-law w.r.t input size (it is a problem for 
ssvd, too) we may need to contemplate a technique that creates finer splits for 
such computations based on input size. It very well may be the case that 
original hdfs splits may turn out to be too large for adequate load 
redistribution.

Technically, it is extremely simple -- we'd just have to insert a physical 
operator tweaking RDD splits via shuffless coalesce() which also costs 
nothing in Spark. However, i am not sure what would be sensible API for this -- 
automatic, semi-automatic cost-based...  

I guess one brainless thing to do is to parameterize optimizer context with 
desired parallelism (~cluster task capacity) and have optimizer to insert 
physical opertors that very # of partitions and do automatic shuffless coalesce 
if the number is too low

any thoughts?



was (Author: dlyubimov):
bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' 
dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 
values but that seems smallish. Can't say when I'll get to it but it's on my 
list. If someone can jump in quicker--have at it.

@Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for 
the flops alone. Did your original version also used matrix squaring? How did 
it fare?

Also, since the flops grow power-law w.r.t input size (it is a problem for 
ssvd, too) we may need to contemplate a technique that creates finer splits for 
such computations based on input size. It very well may be the case that 
original hdfs splits may turn out to be too large for adequate load 
redistribution.

Technically, it is extremely simple -- we'd just have to insert a physical 
operator tweaking RDD splits via shuffless coalesce() which also costs 
nothing in Spark. However, i am not sure what would be sensible API for this -- 
automatic, semi-automatic cost-based...  

I guess one brainless thing to do is to parameterize drmContext with desired 
parallelism (~cluster task capacity) and have optimizer to insert physical 
opertors that very # of partitions and do automatic shuffless coalesce if the 
number is too low

any thoughts?


 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941511#comment-13941511
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

yeah .. those views.. i think they create at least 2 objects interim... not so 
cool for mass iterations. Oh well.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941517#comment-13941517
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

I have non-slim A'A. Of course slim operator implementation is upper triangular 
that cuts outer product computation cost two times in comparison... 
Significantly wide A'A on the other hand cannot really apply the same cut, 
since it needs to form rows in distributed way.

Not surprisingly, slim test takes 17 seconds  and the fat one takes 21 
seconds on my fairly ancient computer for squaring 400x550 matrix (single 
thread). Actually, i expected a little more significant gap.

 I wonder if there's a more interesting way to do this other than forming outer 
product vertical blocks.

Maybe I need to use square blocks. In this case i can reuse roughly half of 
them -- but then there will be significantly more objects with this (albeit 
smaller in size). and then i will have to have an extra shuffle operation to 
form the lower triangular part of the matrix still. 

Anyway. i think i will commit what i have.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941520#comment-13941520
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

On Thu, Mar 20, 2014 at 1:42 AM, Sebastian Schelter (JIRA)

For that very reason, i almost always use SRM and almost never SM.

What i really would probably love is a sparse row and column block (hash
hanging from hash), this seems like recurring issue in blocking
calculations such as ALS. SRM does always that, except it uses full size
vector to hang sprase vectors off.




 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941531#comment-13941531
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

No, i think blockify is fine. it probably can run a bit faster than it does, 
but oh well. 

And mapblock doesn't trigger it (or, rather, it is evaluated lazily; and if 
previous operator already produced blocks, then blockify is not used). what i 
was saying is along the lines of A'A computation. There's a structure that is 
used to fuse operators, which is sort of eitherOr of either DrmRdd or 
BlockifiedDrmRdd type. I can to conclusion that there are operators that are 
absolute pain to implement on blocks, and there are that would be pain to 
implement on row vector bags. But blocks can be presented as row bags via 
viewing, so conversion to blocks happens only if subsequent operator requires 
it. What's more, usually block operator outputs blocks as well and vice versa, 
so realistically blockify happens not so often at all.

Another caveat is that one has to be careful with map blocks with side effects 
on RDD of origin. Even though Spark says all RDDs are immutable, side effects 
will stay visible to parent RDDs if they are cached as MEMORY_ONLY or 
MEMORY_AND_DISK (i.e. without mandatory clone-via-serialization in block 
manager) and then subsequently used as a source again.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941541#comment-13941541
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Oh, you mean in case of sparse row vectors.
 You are probably right. indeed, there's currently a SparseMatrix there in this 
case. I think it should be SparseRowMatrix of course. most of the cases should 
benefit from it. Problem is, like i said, mapblock doesn't really form it; nor 
any other physical operator has any knowledge what formed it. 

It is possible to optimize the entire operator fusion chain based on subsequent 
operator preferred type, that's actually a very neat idea for in-core speed 
optimization; but i have no capacity to pursue this technique at the moment. It 
needs some digestion anyway (at least on my end). It requires experiments with 
in-core operations. At the first glance, most non-multiplicative operators 
would be ok with row-wise matrix, as well as deblockifying views.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941545#comment-13941545
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

If anything, at least i see non-negligible speed up in the blockification 
itself it seems once i use row matrix. I think i will commit that. 

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941354#comment-13941354
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

Actually, non-slim A'A operator is practically A'B without need for a zip... So 
we are almost done, the biggest work here is the test I suppose.

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941355#comment-13941355
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Actually, non-slim A'A operator is practically A'B without need for a zip... So 
we are almost done, the biggest work here is the test I suppose.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-18 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939451#comment-13939451
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

That's what i normally do, yes. The scalabindings issue points to a branch in 
github. Then there's commit-squash method (described in my blog) i do when 
pushing to svn. Hopefully we'd see direct git pushes for mahout sooner rather 
than later.
However, seeing a combined (squashed) patch is pretty useful too as apposed 
to tons of indivdiual commits.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-18 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939468#comment-13939468
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1464 at 3/18/14 4:56 PM:
---

@[~ssc] Looking nice.

I guess we want non-skinny version of operator A'A still, i may be able to look 
into it.


was (Author: dlyubimov):
[~ssc] Looking nice.

I guess we want non-skinny version of operator A'A still, i may be able to look 
into it.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-18 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939468#comment-13939468
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

[~ssc] Looking nice.

I guess we want non-skinny version of operator A'A still, i may be able to look 
into it.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-18 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939513#comment-13939513
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--


http://weatheringthrutechdays.blogspot.com/2011/04/git-github-and-committing-to-asf-svn.html


 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937951#comment-13937951
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

I only ever ran spark code with hdfs cluster of cdh 4. Mapreduce api is 
irrelevant, which is where most of 2.0 vs 1 thing happens, only hdfs is, since 
spark doesnt need mr cluster. Spark can also run under yarn supervision, which 
would imply 2.0, but i would strongly recommend against it and use mesos plus 
zookeeper.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937954#comment-13937954
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Ps spark module has cdh4 maven profile.

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: (was: ScalaSparkBindings.pdf)

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: ScalaSparkBindings.pdf

updating docs to reflect latest committed state. 
Brought in distributed and in-core stochastic PCA scripts, colmeans, colsums, 
drm-vector multiplication, more tests etc.etc. see the doc.

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938799#comment-13938799
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

What's the best way to share PDF source? i can put it on the site so committers 
can re-generate it. otherwise, its source right now in my github doc branch 
here and pull request is definitely possible way to collaborate too: 
https://github.com/dlyubimov/mahout-commits/tree/ssvd-docs

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938805#comment-13938805
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

1.
{code}
val C = A.t %*% A
{code}

 I don't remember if i actually put in the physical operator for non-skinny A. 
There are two distinct algorithms to deal with it. Skinny one (n = 5000 or 
something) uses upper-triangular vector-backed accumulator to combine stuff 
right in map. Of course if accumulator does not realistically fit in memory 
then another algorithm has to be plugged in for A-squared. See AtA.scala, def 
at_a_nongraph(). It currently throws UnsupportedOperation (but everything i 
have done so far only uses slim A'A)

2. when using partial functions with mapBlock, you actually do not have to use 
({...}) but just { }:
{code}
  drmBt = drmBt.mapBlock() {
case (keys, block) =
//...
  keys - block
  }
{code}

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark

2014-03-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938810#comment-13938810
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Also, just FYI, much as i love to use single-letter capitals for matrices, it 
turns out Scala is not really  ingesting it in all situtations. For example, 
{code}
val (U, V, s) = ssvd(...)
{code}

doesn't compile. 

So i ended up using, perhaps verbosely, drmA and inCoreA notations. 
Perhaps we can agree on what's reasonable .

 RowSimilarityJob on Spark
 -

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
 Environment: hadoop, spark
Reporter: Pat Ferrel
  Labels: performance
 Fix For: 0.9

 Attachments: MAHOUT-1464.patch


 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
 here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
 with Mahout Spark DRM DSL so a DRM can be used as input. 
 Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
 RSJ on two inputs to calculate the similarity of rows of one DRM with those 
 of another. This cross-similarity has several applications including 
 cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-07 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Affects Version/s: (was: 0.8)
   0.9
   Status: Patch Available  (was: Open)

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-06 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: ScalaSparkBindings.pdf

Ok this is finally done. SSVD is working, notes updated. I will commit it later 
tonight after additional review for misc stuff.

Please look at the final pdf api ,  and source if needed. 

This will also contain fix for CholeskyDecomosition bug that always reports 
degenerate matrix.

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-06 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923389#comment-13923389
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

Most of this code is not distributed-tested. Assumption is that we will have to 
continue working stuff out and gauge bottlenecks of concrete implementations. 
It is possible additional tuning parameters will be required, esp. for stuff 
that does blocking etc. 

So it should be marked as evolving . 

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-06 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923389#comment-13923389
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1346 at 3/7/14 5:08 AM:
--

Most of this code is not distributed-tested. Unit tests do due diligence and 
ensure matrices are produced with more than a trivially single partition, and i 
also verified some stuff on a live single node spark but i haven't tried any 
significant datasets in a reall life cluster. 

Assumption is that we will have to continue working stuff out and gauge 
bottlenecks of concrete implementations. It is possible additional tuning 
parameters will be required, esp. for stuff that does blocking etc. 

So it should be marked as evolving . 


was (Author: dlyubimov):
Most of this code is not distributed-tested. Assumption is that we will have to 
continue working stuff out and gauge bottlenecks of concrete implementations. 
It is possible additional tuning parameters will be required, esp. for stuff 
that does blocking etc. 

So it should be marked as evolving . 

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-04 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: (was: ScalaSparkBindings.pdf)

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-04 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: ScalaSparkBindings.pdf

update

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-04 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: (was: ScalaSparkBindings.pdf)

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-03-04 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: ScalaSparkBindings.pdf

@Sebastian (and et al) could you please review if not the code then at least 
the API pdf (attached)? At this point i have all functional components to do 
distributed SSVD in dsl so it is really on the verge of commit, but i wouldn't 
want do that without no review at all (given how relatively big and conceptual 
this thing is).

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: ScalaSparkBindings.pdf

WIP manual and working notes

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906741#comment-13906741
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

Yeah. I am not sure what they are doing there. Last time i looked at it, MLLib 
did not have any form of weighed ALS. Now this exapmple seems to include 
trainImplicit which works on the rating matrix only. In original formulation 
of implicit feedback problem there were two values, preference and confidence 
in such preference. So i am not sure what they do there since the input is 
obviously one sparse matrix. 

My generalization of the problem includes formulation where any confidence 
level could be attached to either 0 or 1 as a preference, plus baseline. I also 
assume that model may have more than one parameter to form confidence which 
requires fitting as well. (simply speaking what is level of consumption if 
user clicks on it vs. add-2-cart, if any etc.). Similarly, there could be 
difference levels of confidence of ignoring stuff depending on situation. So 0 
preferences do not have to always have the baseline confidence either.

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906752#comment-13906752
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

 that's reasonable encoding i suppose. Good idea.

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1408.
--

Resolution: Won't Fix

Don't see a reason to do anything.

 Distributed cache file matching bug while running SSVD in broadcast mode
 

 Key: MAHOUT-1408
 URL: https://issues.apache.org/jira/browse/MAHOUT-1408
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Angad Singh
Assignee: Dmitriy Lyubimov
Priority: Minor
 Attachments: BtJob.java.patch


 The error is:
 java.lang.IllegalArgumentException: Unexpected file name, unable to deduce 
 partition 
 #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154)
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1)
   at java.util.Arrays.mergeSort(Arrays.java:1270)
   at java.util.Arrays.mergeSort(Arrays.java:1281)
   at java.util.Arrays.mergeSort(Arrays.java:1281)
   at java.util.Arrays.sort(Arrays.java:1210)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:94)
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
   at org.apache.hadoop.mapred.Child.main(Child.java:260)
 The bug is @ 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java,
  near line 220.
 and  @ 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
  near line 144.
 SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache 
 will have a particular pattern whereas we have jar files in our distributed 
 cache which causes the above exception.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Fix Version/s: (was: Backlog)
   1.0

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906699#comment-13906699
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

This is now tracked here 
https://github.com/dlyubimov/mahout-commits/tree/dev-1.0-spark
new module spark. 

I have been rewriting certain things anew. 

Concepts : 
(a) Logical operators (including DRM sources) are expressed as DRMLike trait.
(b) taking a note from spark book, DRM operators (such as %*% or t) form 
operator lineage.  Operator lineage does not get optimized into RDD until 
action applied (spark terminology used). 

(c) Unlike in spark, action doesn't really cause any execution but (1) 
forming optimized RDD sequence (2) producing checkpointed DRM. Consequently, 
checkpointed DRM has RDD lineage attached to it, which is also marked for 
cacheing. Subsequently additional lineages starting out of a checkpointed DRM, 
will not be able to optimize beyond this checkpoint.

(d) there's a super action on checkpointed RDD  - such as collection or 
persitence to HDFS that triggers, if necessary, optimization checkpoint and 
Spark action. 

E.g. 

{code}
val A = drmParallelize(...)

// doesn't do anything, give opportunity for operator lineage to grow further 
before being optimized
val squaredA = A.t %*% A

// we may trigger optimizer and RDD lineage generation and cacheing explicitly 
by: 
squaredA.checkpoint()

// Or, we can call superAction directly. This will trigger checkpoint() 
implicitly if not yet done
val inCoreSquaredA = squaredA.collect()
{code}

Generally, i support for very few things -- I actually dropped all previously 
implemented Bagel algorithms. So in fact i have less support now than in 0.9 
branch. 

i have kryo support for Mahout vectors and matrix blocks. 
I have hdfs read/write of Mahout's DRM into DRMLike trait. 

I have some DSL defined such as 
A %*% B 
A %*% inCoreB
inCoreA %*%: B

A.t
inCoreA = A.collect

A.blockify (coalesces split records into RDD of vertical blocks -- sort of 
paradigm simiilar to MLI's MatrixSubmatrix except I implemented it before MLI 
was announced for the first time :) so no MLI influence here in fact )

So now i need to reimplement what Bagel used to be doing, plus optimizer rules 
for choosing distributed algorithm based on cost rules.

In fact i came to conclusion there was 0 benefit in using Bagel in the first 
place, since it just maps all its primitives into shuffle-and-hash group-by RDD 
operations so there is no any actual operational benefit to using it.

I probably will reconstitute algorithms at the first iteration using regular 
spark primitives (groupBy and cartesian for multiplication blocks)

Once i plug missing pieces (e.g. slim matrix multiplication) I bet i would be 
able to fit distributed SSVD version in 40 lines just like the in-core one :)

Weighted ALS will still be looking less elegant because of some lacking 
features in linear algebra. For example, it seems like sparse block support 
(i.e. bunch of sparse row or column vectors hanging off a very small hash map 
instead of full-size array as in SparseRow(column)Matrix today), but still 
mostly R-like scripted as far as working with matrix blocks and decompositions.

So at this point i'd be willing to hear input on these ideas and direction. 
Perhaps some suggestions. Thanks.


 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906710#comment-13906710
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

a few obvious optimizer rules 

A.t %*% A is obviously detected as a family of unary algorithsm rather than a 
binary multiplication alborithm

Geometry and non-zero element estimate plays role in selection of type of 
algorithm. 

Biggest multiplication via group-by will have to deal, obviously, with 
cartesian operator and will apply to (A * B')

Obvious rewrites: 
A'*B' = (B * A )' (transposition push-up, including elementwise operators too)
(A')' = A (transposition merge)
cost based grouping (A*B)*C versus A*(B*C)
special distributed algorithm versions for in-core operands and diagonal 
matrices



 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Fix Version/s: (was: Backlog)
   1.0

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906725#comment-13906725
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1365 at 2/20/14 7:54 AM:
---

quite possibly could be. The only thing that i do differently here is the merge 
of approaches of implicit feedback and wieghed regularization paper, but that's 
minor. see the pdf.


was (Author: dlyubimov):
quite possibly could be. The only thing that i do differently here is the merge 
of approaches of implicit feedback and wieghed regularization paper, but that's 
minor. 

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906725#comment-13906725
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

quite possibly could be. The only thing that i do differently here is the merge 
of approaches of implicit feedback and wieghed regularization paper, but that's 
minor. 

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906729#comment-13906729
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

Oh. and the implicit paper doesn't generalize the search for confidence 
parameters of course. I ignore that formulation here completely. but eventually 
there should be an outer procedure for search for optimum. My particular 
problem was including multiple events with generally unknown confidence weights 
unlike the original implicit feedback work.

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1412) Build warning due to multiple Scala versions

2014-02-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891606#comment-13891606
 ] 

Dmitriy Lyubimov commented on MAHOUT-1412:
--

scalatest has not, and it seems does not build artifacts for 2.9.3 scala. 

There's an artifact for 2.9.2 and 2.10. They seem to imply they build only one 
artifact fit for all the 2.9.whatever. 

Scala is mostly inttroduced to build mixed environment of in-core operations 
and Spark so it tracks Spark versions of scala and scalatest. 

Spark just released 0.9.0 and scalatest just released 2.0 -- we can bump to 
these eventually

 Build warning due to multiple Scala versions
 

 Key: MAHOUT-1412
 URL: https://issues.apache.org/jira/browse/MAHOUT-1412
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Frank Scholten
Priority: Minor

 I see the following build warning:
 22:42:07 [WARNING]  Expected all dependencies to require Scala version: 2.9.3
 22:42:07 [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires 
 scala version: 2.9.3
 22:42:07 [WARNING]  org.scalatest:scalatest_2.9.2:1.9.1 requires scala 
 version: 2.9.2
 22:42:07 [WARNING] Multiple versions of scala libraries detected!
 Which version should we use?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889842#comment-13889842
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

Aha. Spark 0.9.0 with GraphX is finally released. Time to get hands dirty a bit 
in this methinks..

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode

2014-01-23 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov reassigned MAHOUT-1408:


Assignee: Dmitriy Lyubimov

 Distributed cache file matching bug while running SSVD in broadcast mode
 

 Key: MAHOUT-1408
 URL: https://issues.apache.org/jira/browse/MAHOUT-1408
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Angad Singh
Assignee: Dmitriy Lyubimov
Priority: Minor
 Attachments: BtJob.java.patch


 The error is:
 java.lang.IllegalArgumentException: Unexpected file name, unable to deduce 
 partition 
 #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154)
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1)
   at java.util.Arrays.mergeSort(Arrays.java:1270)
   at java.util.Arrays.mergeSort(Arrays.java:1281)
   at java.util.Arrays.mergeSort(Arrays.java:1281)
   at java.util.Arrays.sort(Arrays.java:1210)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:94)
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
   at org.apache.hadoop.mapred.Child.main(Child.java:260)
 The bug is @ 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java,
  near line 220.
 and  @ 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
  near line 144.
 SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache 
 will have a particular pattern whereas we have jar files in our distributed 
 cache which causes the above exception.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode

2014-01-23 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880200#comment-13880200
 ] 

Dmitriy Lyubimov commented on MAHOUT-1408:
--

I take it you are trying to use SSVD solver in some sort of embedded mode, not 
a pure Mahout CLI? 
Still though, i am not sure why you want wrestle control over map reduce from 
SSVD solver in individual MR steps? Additional jars will not get there (nor 
they are needed by SSVD jobs). Mahout architecture, in general,  and this 
pipeline in particular, does not assume you get to manipulate individual job 
settings. This pipeline's step legitimately expects to find the files in the 
cache that SSVD pipeline has put into it. 

I would like to place a burden on you to explain why you think SSVD pipeline 
should expect someone messing up its MR settings.

Assuming however your reasons are valid, this (BtJob mr) would not be the only 
MR case where cache is used in the SSVD pipeline and this patch will not be 
sufficient to do this throughout. 


 Distributed cache file matching bug while running SSVD in broadcast mode
 

 Key: MAHOUT-1408
 URL: https://issues.apache.org/jira/browse/MAHOUT-1408
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Angad Singh
Assignee: Dmitriy Lyubimov
Priority: Minor
 Attachments: BtJob.java.patch


 The error is:
 java.lang.IllegalArgumentException: Unexpected file name, unable to deduce 
 partition 
 #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154)
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1)
   at java.util.Arrays.mergeSort(Arrays.java:1270)
   at java.util.Arrays.mergeSort(Arrays.java:1281)
   at java.util.Arrays.mergeSort(Arrays.java:1281)
   at java.util.Arrays.sort(Arrays.java:1210)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:94)
   at 
 org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
   at org.apache.hadoop.mapred.Child.main(Child.java:260)
 The bug is @ 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java,
  near line 220.
 and  @ 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
  near line 144.
 SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache 
 will have a particular pattern whereas we have jar files in our distributed 
 cache which causes the above exception.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable

2014-01-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1397:
-

Resolution: Invalid
Status: Resolved  (was: Patch Available)

 mahaout-math-scala/pom.xml not readable
 ---

 Key: MAHOUT-1397
 URL: https://issues.apache.org/jira/browse/MAHOUT-1397
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Windows 7 Professional 64 bit
 Eclipse:
 Version: Kepler Service Release 1
 Build id: 20130919-0819
 maven 3.0.5
 Java: jdk1.6.0_45
Reporter: Maruf Aytekin
Assignee: Dmitriy Lyubimov
  Labels: maven
 Fix For: 1.0


 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error.
 {code}
   plugin
   groupIdorg.scala-tools/groupId
   artifactIdmaven-scala-plugin/artifactId
   executions
   execution
   goals
   
 goalcompile/goal
   
 goaltestCompile/goal
   /goals
   /execution
   /executions
   configuration
   sourceDirsrc/main/scala/sourceDir
   jvmArgs
   jvmArg-Xms64m/jvmArg
   jvmArg-Xmx1024m/jvmArg
   /jvmArgs
   /configuration
   /plugin
 {code}
 Error displayed:
 {quote}
 Multiple annotations found at this line:
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: 
 compile)
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, 
 phase: test-
compile)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable

2014-01-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876145#comment-13876145
 ] 

Dmitriy Lyubimov commented on MAHOUT-1397:
--

on a side note -- i spent more than a decade with eclipse. It was scala and 
maven support in eclipse (or, rather, lack of thereof) that finally forced my 
hand to switch.

 mahaout-math-scala/pom.xml not readable
 ---

 Key: MAHOUT-1397
 URL: https://issues.apache.org/jira/browse/MAHOUT-1397
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Windows 7 Professional 64 bit
 Eclipse:
 Version: Kepler Service Release 1
 Build id: 20130919-0819
 maven 3.0.5
 Java: jdk1.6.0_45
Reporter: Maruf Aytekin
Assignee: Dmitriy Lyubimov
  Labels: maven
 Fix For: 1.0


 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error.
 {code}
   plugin
   groupIdorg.scala-tools/groupId
   artifactIdmaven-scala-plugin/artifactId
   executions
   execution
   goals
   
 goalcompile/goal
   
 goaltestCompile/goal
   /goals
   /execution
   /executions
   configuration
   sourceDirsrc/main/scala/sourceDir
   jvmArgs
   jvmArg-Xms64m/jvmArg
   jvmArg-Xmx1024m/jvmArg
   /jvmArgs
   /configuration
   /plugin
 {code}
 Error displayed:
 {quote}
 Multiple annotations found at this line:
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: 
 compile)
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, 
 phase: test-
compile)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable

2014-01-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875085#comment-13875085
 ] 

Dmitriy Lyubimov commented on MAHOUT-1397:
--

you sure? i trust what idea prompts me. Ok let me check.

 mahaout-math-scala/pom.xml not readable
 ---

 Key: MAHOUT-1397
 URL: https://issues.apache.org/jira/browse/MAHOUT-1397
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Windows 7 Professional 64 bit
 Eclipse:
 Version: Kepler Service Release 1
 Build id: 20130919-0819
 maven 3.0.5
 Java: jdk1.6.0_45
Reporter: Maruf Aytekin
  Labels: maven
 Fix For: 1.0


 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error.
 {code}
   plugin
   groupIdorg.scala-tools/groupId
   artifactIdmaven-scala-plugin/artifactId
   executions
   execution
   goals
   
 goalcompile/goal
   
 goaltestCompile/goal
   /goals
   /execution
   /executions
   configuration
   sourceDirsrc/main/scala/sourceDir
   jvmArgs
   jvmArg-Xms64m/jvmArg
   jvmArg-Xmx1024m/jvmArg
   /jvmArgs
   /configuration
   /plugin
 {code}
 Error displayed:
 {quote}
 Multiple annotations found at this line:
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: 
 compile)
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, 
 phase: test-
compile)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable

2014-01-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875089#comment-13875089
 ] 

Dmitriy Lyubimov commented on MAHOUT-1397:
--

Hm. Either IntelliJ is wrong, or you.

 mahaout-math-scala/pom.xml not readable
 ---

 Key: MAHOUT-1397
 URL: https://issues.apache.org/jira/browse/MAHOUT-1397
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Windows 7 Professional 64 bit
 Eclipse:
 Version: Kepler Service Release 1
 Build id: 20130919-0819
 maven 3.0.5
 Java: jdk1.6.0_45
Reporter: Maruf Aytekin
  Labels: maven
 Fix For: 1.0


 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error.
 {code}
   plugin
   groupIdorg.scala-tools/groupId
   artifactIdmaven-scala-plugin/artifactId
   executions
   execution
   goals
   
 goalcompile/goal
   
 goaltestCompile/goal
   /goals
   /execution
   /executions
   configuration
   sourceDirsrc/main/scala/sourceDir
   jvmArgs
   jvmArg-Xms64m/jvmArg
   jvmArg-Xmx1024m/jvmArg
   /jvmArgs
   /configuration
   /plugin
 {code}
 Error displayed:
 {quote}
 Multiple annotations found at this line:
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: 
 compile)
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, 
 phase: test-
compile)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable

2014-01-17 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov reassigned MAHOUT-1397:


Assignee: Dmitriy Lyubimov

 mahaout-math-scala/pom.xml not readable
 ---

 Key: MAHOUT-1397
 URL: https://issues.apache.org/jira/browse/MAHOUT-1397
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Windows 7 Professional 64 bit
 Eclipse:
 Version: Kepler Service Release 1
 Build id: 20130919-0819
 maven 3.0.5
 Java: jdk1.6.0_45
Reporter: Maruf Aytekin
Assignee: Dmitriy Lyubimov
  Labels: maven
 Fix For: 1.0


 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error.
 {code}
   plugin
   groupIdorg.scala-tools/groupId
   artifactIdmaven-scala-plugin/artifactId
   executions
   execution
   goals
   
 goalcompile/goal
   
 goaltestCompile/goal
   /goals
   /execution
   /executions
   configuration
   sourceDirsrc/main/scala/sourceDir
   jvmArgs
   jvmArg-Xms64m/jvmArg
   jvmArg-Xmx1024m/jvmArg
   /jvmArgs
   /configuration
   /plugin
 {code}
 Error displayed:
 {quote}
 Multiple annotations found at this line:
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: 
 compile)
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, 
 phase: test-
compile)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable

2014-01-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875091#comment-13875091
 ] 

Dmitriy Lyubimov commented on MAHOUT-1397:
--

also: http://scala-tools.org/mvnsites/maven-scala-plugin/usage.html

{code:title=correct usage example}
plugin
groupIdorg.scala-tools/groupId
artifactIdmaven-scala-plugin/artifactId
executions
  execution
goals
  goalcompile/goal
  goaltestCompile/goal
/goals
  /execution
/executions
configuration
  scalaVersion${scala.version}/scalaVersion
/configuration
  /plugin


{code}


Sorry, I think you need to try to be more convincing.

-d




 mahaout-math-scala/pom.xml not readable
 ---

 Key: MAHOUT-1397
 URL: https://issues.apache.org/jira/browse/MAHOUT-1397
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Windows 7 Professional 64 bit
 Eclipse:
 Version: Kepler Service Release 1
 Build id: 20130919-0819
 maven 3.0.5
 Java: jdk1.6.0_45
Reporter: Maruf Aytekin
  Labels: maven
 Fix For: 1.0


 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error.
 {code}
   plugin
   groupIdorg.scala-tools/groupId
   artifactIdmaven-scala-plugin/artifactId
   executions
   execution
   goals
   
 goalcompile/goal
   
 goaltestCompile/goal
   /goals
   /execution
   /executions
   configuration
   sourceDirsrc/main/scala/sourceDir
   jvmArgs
   jvmArg-Xms64m/jvmArg
   jvmArg-Xmx1024m/jvmArg
   /jvmArgs
   /configuration
   /plugin
 {code}
 Error displayed:
 {quote}
 Multiple annotations found at this line:
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: 
 compile)
   - Plugin execution not covered by lifecycle configuration: 
 org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, 
 phase: test-
compile)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1305) Rework the wiki

2014-01-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861764#comment-13861764
 ] 

Dmitriy Lyubimov commented on MAHOUT-1305:
--

By all means I agree.

I also still owe migration for scala bindings of Mahout's math per M-1297
(I guess last time i was thrown off by same CMS issues


On Thu, Jan 2, 2014 at 2:39 PM, Isabel Drost-Fromm (JIRA)



 Rework the wiki
 ---

 Key: MAHOUT-1305
 URL: https://issues.apache.org/jira/browse/MAHOUT-1305
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Reporter: Sebastian Schelter
Priority: Blocker
 Fix For: 0.9

 Attachments: MAHOUT-221213-1315-15716.pdf


 We should think about completely redoing our wiki. At the moment, we're 
 listing lots of algorithms that we either never implemented or already 
 removed. I also have the impression that a lot of stuff is outdated.
 It would be awesome if we had an up-to-date documentation of the code with 
 instructions on how to get into using mahout quickly.
 We should also have examples for all our 3 C's.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1305) Rework the wiki

2014-01-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861764#comment-13861764
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1305 at 1/3/14 6:50 PM:
--

By all means I agree.

I also still owe migration for scala bindings of Mahout's math per M-1297
(I guess last time i was thrown off by some CMS issues)


On Thu, Jan 2, 2014 at 2:39 PM, Isabel Drost-Fromm (JIRA)




was (Author: dlyubimov):
By all means I agree.

I also still owe migration for scala bindings of Mahout's math per M-1297
(I guess last time i was thrown off by same CMS issues


On Thu, Jan 2, 2014 at 2:39 PM, Isabel Drost-Fromm (JIRA)



 Rework the wiki
 ---

 Key: MAHOUT-1305
 URL: https://issues.apache.org/jira/browse/MAHOUT-1305
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Reporter: Sebastian Schelter
Priority: Blocker
 Fix For: 0.9

 Attachments: MAHOUT-221213-1315-15716.pdf


 We should think about completely redoing our wiki. At the moment, we're 
 listing lots of algorithms that we either never implemented or already 
 removed. I also have the impression that a lot of stuff is outdated.
 It would be awesome if we had an up-to-date documentation of the code with 
 instructions on how to get into using mahout quickly.
 We should also have examples for all our 3 C's.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1305) Rework the wiki

2014-01-02 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860342#comment-13860342
 ] 

Dmitriy Lyubimov commented on MAHOUT-1305:
--

SSVD pages 
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
along with all attachements and references must be retained. I spent a lot of 
time writing instructions and explanations there.

In fact, it is flagship method now for dimensionality reduction, PCA and 
LSA-like problems . I inserted references to PCA, dimensionality reduction and 
SVD pages to this method as a first try over Lanczos and see these references 
also gone now in the public version.

 Rework the wiki
 ---

 Key: MAHOUT-1305
 URL: https://issues.apache.org/jira/browse/MAHOUT-1305
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Reporter: Sebastian Schelter
Priority: Blocker
 Fix For: 0.9

 Attachments: MAHOUT-221213-1315-15716.pdf


 We should think about completely redoing our wiki. At the moment, we're 
 listing lots of algorithms that we either never implemented or already 
 removed. I also have the impression that a lot of stuff is outdated.
 It would be awesome if we had an up-to-date documentation of the code with 
 instructions on how to get into using mahout quickly.
 We should also have examples for all our 3 C's.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-27 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-27 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

updating, minor errors in pdf doc

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Description: 
Given preference P and confidence C distributed sparse matrices, compute ALS-WR 
solution for implicit feedback (Spark Bagel version).

Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
build C matrix), with parameterized test for convergence.

The computational scheme is following ALS-WR method (which should be slightly 
more efficient for sparser inputs). 

The best performance will be achieved if non-sparse anomalies prefilitered 
(eliminated) (such as an anomalously active user which doesn't represent 
typical user anyway).

the work is going here 
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting 
away our (A1) implementation so there are a few issues associated with that.

  was:
Given preference P and confidence C distributed sparse matrices, compute ALS-WR 
solution for implicit feedback (Spark Bagel version).

Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
build C matrix), with parameterized test for convergence.

The computational scheme is followsing ALS-WR method (which should be slightly 
more efficient for sparser inputs). 

The best performance will be achieved if non-sparse anomalies prefilitered 
(eliminated) (such as an anomalously active user which doesn't represent 
typical user anyway).

the work is going here 
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting 
away our (A1) implementation so there are a few issues associated with that.


 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832110#comment-13832110
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1365 at 11/26/13 9:15 PM:


Oh. One thing to mention is that the confidence matrix C is not sparse per se. 
but if there's a base confidence c_0 such that subtracting it from each element 
of C turns it into sparse matrix C', then we can use that matrix as an input 
(along with c_0 parameter). This is further clarified in the attachment (which 
is basically just a conspectus of both papers for my own sake.) See attached.


was (Author: dlyubimov):
Oh. One thing to mention is that the confidence matrix C is not sparse per se. 
but if there's a base confidence c_0 such that subtracting it from each element 
of C turns it into sparse matrix C', then we can use that matrix as an input 
(along with c_0 parameter). This is further clarified in the attachment (which 
is basically just a conspect of both papers for my own sake.) See attached.

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is following ALS-WR method (which should be slightly 
 more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1297) New module for linear algebra scala DSL (in-core operators support only to start with)

2013-11-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13833264#comment-13833264
 ] 

Dmitriy Lyubimov commented on MAHOUT-1297:
--

yes. Spark 0.8 still has 2.9.3 scala dependency. since this issue is really a 
dependency crutch for spark-distributed issues (at this point anyway), hence.

 New module for linear algebra scala DSL (in-core operators support only to 
 start with)
 --

 Key: MAHOUT-1297
 URL: https://issues.apache.org/jira/browse/MAHOUT-1297
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 0.9


 See initial set of in-core R-like operations here 
 http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html.
 A separate DSL for matlab-like syntax is being developed. The differences 
 here are about replacing R-like %*% with * and finding another way to express 
 elementwise * and /.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MAHOUT-1363) Rebase packages in mahout-scala

2013-11-25 Thread Dmitriy Lyubimov (JIRA)
Dmitriy Lyubimov created MAHOUT-1363:


 Summary: Rebase packages in mahout-scala
 Key: MAHOUT-1363
 URL: https://issues.apache.org/jira/browse/MAHOUT-1363
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Priority: Minor
 Fix For: 0.9


It has occurred to me that in my commit of mahout-scala stuff, i haven't 
rebased packages onto o.a.m... as has been discussed. 

it also has occurred to me that putting that stuff into o.a.m.math in this case 
may create unwelcome interference between java and scala stuff. 

So I am moving scala math DSL stuff into 0.a.m.math.scalabindings package. It 
is awfully awkward compared to just mahout.math scala style package it bears 
now, but i guess modern IDE tools make it no problem to import. 





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1363) Rebase packages in mahout-scala

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1363:
-

Status: Patch Available  (was: Open)

 Rebase packages in mahout-scala
 ---

 Key: MAHOUT-1363
 URL: https://issues.apache.org/jira/browse/MAHOUT-1363
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Priority: Minor
 Fix For: 0.9


 It has occurred to me that in my commit of mahout-scala stuff, i haven't 
 rebased packages onto o.a.m... as has been discussed. 
 it also has occurred to me that putting that stuff into o.a.m.math in this 
 case may create unwelcome interference between java and scala stuff. 
 So I am moving scala math DSL stuff into 0.a.m.math.scalabindings package. It 
 is awfully awkward compared to just mahout.math scala style package it 
 bears now, but i guess modern IDE tools make it no problem to import. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1363) Rebase packages in mahout-scala

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1363:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Rebase packages in mahout-scala
 ---

 Key: MAHOUT-1363
 URL: https://issues.apache.org/jira/browse/MAHOUT-1363
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Priority: Minor
 Fix For: 0.9


 It has occurred to me that in my commit of mahout-scala stuff, i haven't 
 rebased packages onto o.a.m... as has been discussed. 
 it also has occurred to me that putting that stuff into o.a.m.math in this 
 case may create unwelcome interference between java and scala stuff. 
 So I am moving scala math DSL stuff into 0.a.m.math.scalabindings package. It 
 is awfully awkward compared to just mahout.math scala style package it 
 bears now, but i guess modern IDE tools make it no problem to import. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)
Dmitriy Lyubimov created MAHOUT-1365:


 Summary: Weighted ALS-WR iterator for Spark
 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


Given preference P and confidence C distributed sparse matrices, compute ALS-WR 
solution for implicit feedback (Spark Bagel version).

Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
build C matrix), with parameterized test for convergence.

The computational scheme is followsing ALS-WR method (which should be slightly 
more efficient for sparser inputs). 

The best performance will be achieved if non-sparse anomalies prefilitered 
(eliminated) (such as an anomalously active user which doesn't represent 
typical user anyway).

the work is going here 
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting 
away our (A1) implementation so there are a few issues associated with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

Oh. the confidence matrix C is not sparse per se. but if there's a base 
confidence c_0 such that subtracting it from each element of C turns it into 
sparse matrix C', then we can use that matrix as an input (along with c_0 
parameter). This is further clarified in the attachment (which is basically 
just a conspect of both papers for my own sake.) See attached.

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832110#comment-13832110
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1365 at 11/26/13 12:26 AM:
-

Oh. One thing to mention is that the confidence matrix C is not sparse per se. 
but if there's a base confidence c_0 such that subtracting it from each element 
of C turns it into sparse matrix C', then we can use that matrix as an input 
(along with c_0 parameter). This is further clarified in the attachment (which 
is basically just a conspect of both papers for my own sake.) See attached.


was (Author: dlyubimov):
Oh. the confidence matrix C is not sparse per se. but if there's a base 
confidence c_0 such that subtracting it from each element of C turns it into 
sparse matrix C', then we can use that matrix as an input (along with c_0 
parameter). This is further clarified in the attachment (which is basically 
just a conspect of both papers for my own sake.) See attached.

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.lyx

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.lyx)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Attachment: (was: distributed-als-with-confidence.pdf)

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-25 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832148#comment-13832148
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

there's obviously some stuff that needs tidying up for the sake of the public. 
Some stuff (like RMSE function) looks outwardly cryptic to me, after some time 
has passed since i did this 

 Weighted ALS-WR iterator for Spark
 --

 Key: MAHOUT-1365
 URL: https://issues.apache.org/jira/browse/MAHOUT-1365
 Project: Mahout
  Issue Type: Task
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog

 Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
 build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs). 
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1361) Online algorithm for computing accurate Quantiles using 1-D clustering

2013-11-22 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830479#comment-13830479
 ] 

Dmitriy Lyubimov commented on MAHOUT-1361:
--

Ted,

it's my understanding current code works on double values (integers).

Do you think it is possible to adapt it to a lexicographical order of unlimited 
values? Thank you.

 Online algorithm for computing accurate Quantiles using 1-D clustering
 --

 Key: MAHOUT-1361
 URL: https://issues.apache.org/jira/browse/MAHOUT-1361
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.9
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.9

 Attachments: MAHOUT-1361.patch


 Implementation of Ted Dunning's paper and initial work on this subject. See 
 https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
  for the paper.
 An on-line algorithm for computing approximations of rank-based statistics 
 that allows controllable accuracy. This algorithm can also be used to compute 
 hybrid statistics such as trimmed means in addition to computing arbitrary 
 quantiles.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1361) Online algorithm for computing accurate Quantiles using 1-D clustering

2013-11-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828249#comment-13828249
 ] 

Dmitriy Lyubimov commented on MAHOUT-1361:
--

interesting. 

i've been using minCountSketch quantiles with a fairly ok results . How does it 
compare in effort/precision to minCountSketch and similar sketchlike stuff?

 Online algorithm for computing accurate Quantiles using 1-D clustering
 --

 Key: MAHOUT-1361
 URL: https://issues.apache.org/jira/browse/MAHOUT-1361
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.9
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.9

 Attachments: MAHOUT-1361.patch


 Implementation of Ted Dunning's paper and initial work on this subject. See 
 https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
  for the paper.
 An on-line algorithm for computing approximations of rank-based statistics 
 that allows controllable accuracy. This algorithm can also be used to compute 
 hybrid statistics such as trimmed means in addition to computing arbitrary 
 quantiles.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1317) Clarify some of the messages in Preconditions.checkArgument

2013-11-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828276#comment-13828276
 ] 

Dmitriy Lyubimov commented on MAHOUT-1317:
--

Seems useful. 

I think i saw inconsistent indentation here and there,  but it seems it is 
notoriously difficult to agree on things like function parameter indentation 
style etc.

 Clarify some of the messages in Preconditions.checkArgument
 ---

 Key: MAHOUT-1317
 URL: https://issues.apache.org/jira/browse/MAHOUT-1317
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: BFL
Assignee: Sebastian Schelter
Priority: Minor
 Fix For: 0.9

 Attachments: MAHOUT-1317.patch


 In experimenting with things, I was getting some errors from 
 RowSimilarityJob, that in looking at the source I realized were a little 
 incomplete as to what the true issue was.  In this case, they were of the 
 form:
 Preconditions.checkArgument(maxSimilaritiesPerRow  0, Incorrect maximum 
 number of similarities per row!);
 Here, it is known that the actual issue is that the parameter must be zero 
 (or negative), not just that it's incorrect, and a (trivial) change to the 
 error message might save some folks some time... especially newbies like 
 myself.
 A quick grep of the code showed a few more cases like that across the code 
 base that would be (apparently) easy to fix and maybe save folks time when 
 they get the relevant error.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameter 
over user's preference. suppose we want to perform exploration what's worth 
what. Natural way is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover parabaloid properties along some parameter axes. in which case it 
would mean we got the preference wrong, so we flip the preference mapping. 
(i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:21 PM:


https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameter 
over user's preference. suppose we want to perform exploration what's worth 
what. Natural way is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover hyperbolic parabaloid properties along some parameter axes. in 
which case it would mean we got the preference wrong, so we flip the preference 
mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy


was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameter 
over user's preference. suppose we want to perform exploration what's worth 
what. Natural way is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover parabaloid properties along some parameter axes. in which case it 
would mean we got the preference wrong, so we flip the preference mapping. 
(i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346

[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820483#comment-13820483
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

P.S. I am kind of dubious step-recorded search would be of sufficient 
efficiency either. First, we should not assume we are running a good convex 
landscape. Second, i assume step-recorded search may take fairly long .

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:28 PM:


https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover hyperbolic parabaloid properties along some parameter axes. in 
which case it would mean we got the preference wrong, so we flip the preference 
mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy


was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameter 
over user's preference. suppose we want to perform exploration what's worth 
what. Natural way is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover hyperbolic parabaloid properties along some parameter axes. in 
which case it would mean we got the preference wrong, so we flip the preference 
mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here 

[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:31 PM:


https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation . Posing such a problem presents a 
whole new look at Big Data ML problems. Now we are using distributed 
processing not just because the input might be so big, but also because we have 
a lot of parameter space exploration to do (even if the one iteration problem 
is not so big). And thus produce more interesting analytical results.

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover hyperbolic parabaloid properties along some parameter axes. in 
which case it would mean we got the preference wrong, so we flip the preference 
mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy


was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe 

[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:50 PM:


https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation . Posing such a problem presents a 
whole new look at Big Data ML problems. Now we are using distributed 
processing not just because the input might be so big, but also because we have 
a lot of parameter space exploration to do (even if the one iteration problem 
is not so big). And thus produce more interesting analytical results.

However, since there are many parameters, the task becomes fairly more 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various online convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start running parallel tries and fit the data 
into paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough to get a 
general sense where global maximum may be even on inputs of a fairly small 
size. Of course we may discover hyperbolic parabaloid properties along some 
parameter axes. in which case it would mean we got the preference wrong, so we 
flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = 
(P=0, C=0...) and re-validate again.  This is kind of multidimensional 
variation of one-parameter second degree polynom fitting that Raphael refered 
to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy


was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation . Posing such a problem presents a 
whole new look at Big Data ML problems. Now we are using distributed 
processing not just because the input might be so big, but also because we have 
a lot of parameter space exploration to do (even 

[jira] [Updated] (MAHOUT-1297) New module for linear algebra scala DSL (in-core operators support only to start with)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1297:
-

Status: Patch Available  (was: Open)

 New module for linear algebra scala DSL (in-core operators support only to 
 start with)
 --

 Key: MAHOUT-1297
 URL: https://issues.apache.org/jira/browse/MAHOUT-1297
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 0.9


 See initial set of in-core R-like operations here 
 http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html.
 A separate DSL for matlab-like syntax is being developed. The differences 
 here are about replacing R-like %*% with * and finding another way to express 
 elementwise * and /.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2013-11-12 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820705#comment-13820705
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

can that context be part of Mahout? Or that would be way off ?

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: Backlog


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (MAHOUT-1299) Add optimized versions of timesLeft(), timesRight() to SparseRow~,SparseColMatrices and binary times() operation in general

2013-11-05 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1299.
--

Resolution: Won't Fix

Probably needs to be considered in a broader light of matrix related 
optimizations.

 Add optimized versions of timesLeft(), timesRight() to 
 SparseRow~,SparseColMatrices and binary times() operation in general
 ---

 Key: MAHOUT-1299
 URL: https://issues.apache.org/jira/browse/MAHOUT-1299
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Priority: Minor
 Fix For: 0.9






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1297) New module for linear algebra scala DSL (in-core operators support only to start with)

2013-11-05 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814337#comment-13814337
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1297 at 11/5/13 10:31 PM:


Separating this into its own branch from scala head work. 

it is now tracked in 
https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1297

i will be committing this to the trunk within a week.


was (Author: dlyubimov):
Separating this into its own branch from scala head work. 

it is now tracked in 
https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1297

i will be committing this to the trunk in the next week.

 New module for linear algebra scala DSL (in-core operators support only to 
 start with)
 --

 Key: MAHOUT-1297
 URL: https://issues.apache.org/jira/browse/MAHOUT-1297
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 0.9


 See initial set of in-core R-like operations here 
 http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html.
 A separate DSL for matlab-like syntax is being developed. The differences 
 here are about replacing R-like %*% with * and finding another way to express 
 elementwise * and /.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


<    2   3   4   5   6   7   8   9   10   11   >