[jira] [Commented] (MAHOUT-1489) Interactive Scala Spark Bindings Shell Script processor
[ https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950290#comment-13950290 ] Dmitriy Lyubimov commented on MAHOUT-1489: -- yeah. reallistically, functionality-wise it is not that much we need to add here. It is basic Spark shell + (1) with Mahout classpath of mahout-spark and its transitives added in addition to Spark stuff; (2) importing our standard things automatically (i.e. o.a.m.sparkbidings._, o.a.m.sparkbindings.drm._, RLikeDrmOps._ etc per manual -- make that default package imports easily to add to as we add e.g. data frames dsl). This is not that much, no fundamental hacks are required. In fact, i have done (2)-like things a lot with standard scala interpreter. In our case we of course cannot use standard scala itnterpreter because we need Spark to sync whatever new closures we put into script, with the backend, for us. But we probably can just inherit from Spark interpreter and then modify its automatic imports. The classpath issues shuold be handled by mahout.sh script. Interactive Scala Spark Bindings Shell Script processor --- Key: MAHOUT-1489 URL: https://issues.apache.org/jira/browse/MAHOUT-1489 Project: Mahout Issue Type: New Feature Affects Versions: 1.0 Reporter: Saikat Kanjilal Assignee: Dmitriy Lyubimov Fix For: 1.0 Build an interactive shell /scripting (just like spark shell). Something very similar in R interactive/script runner mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950283#comment-13950283 ] Dmitriy Lyubimov edited comment on MAHOUT-1493 at 3/28/14 2:30 AM: --- I don't think you meant run() to return Unit. Also I am not sure using a class is justified. In most cases, i would favor dropping classes in favor of functions, albeit with fairly long parameter list but populaed with default values. The pattern i am following is to create a pithy and expressive name (such as ssvd()) for a function (in this case could be something like trainNB) inside a scala object (singleton) and then re-translate that as top-level package function so one can say something like {code} import decompositions._ val nbmodel = trainNB(...) ... {code} was (Author: dlyubimov): I don't think you meant run() to return Unit. Also I am not sure using a class is justified. In most cases, i would favor dropping classes in favor of functions, albeit with fairly long parameter list but populaed with default values. The pattern i am following is (1) to create a pithy and expressive name (such as ssvd()) for a function (in this case could be something like trainNB) inside a scala object (singleton) and then re-translate that as top-level package function so one can say something like {code} import decompositions._ val nbmodel = trainNB(...) ... {code} Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950293#comment-13950293 ] Dmitriy Lyubimov commented on MAHOUT-1493: -- PS. it's an R naming style. R almost never exposes api as classes (and, frankly, R classes -- even the latest generation -- is embarassment compared to everything else in existence). Classes are usually needed if there's a state, and we already have that state as the bayes model object, don't we? Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MAHOUT-1489) Interactive linear algebra shell script processor
[ https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov reassigned MAHOUT-1489: Assignee: Dmitriy Lyubimov Interactive linear algebra shell script processor --- Key: MAHOUT-1489 URL: https://issues.apache.org/jira/browse/MAHOUT-1489 Project: Mahout Issue Type: New Feature Affects Versions: 1.0 Reporter: Saikat Kanjilal Assignee: Dmitriy Lyubimov Fix For: 1.0 Build an interactive shell /scripting (just like spark shell). Something very similar in R interactive/script runner mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1489) Interactive linear algebra shell script processor
[ https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948276#comment-13948276 ] Dmitriy Lyubimov commented on MAHOUT-1489: -- I cannot assign to a non-committer, so i will be watching it with assumption the patch is coming from Saikat. (that was condition of creating a new Jira). Interactive linear algebra shell script processor --- Key: MAHOUT-1489 URL: https://issues.apache.org/jira/browse/MAHOUT-1489 Project: Mahout Issue Type: New Feature Affects Versions: 1.0 Reporter: Saikat Kanjilal Fix For: 1.0 Build an interactive shell /scripting (just like spark shell). Something very similar in R interactive/script runner mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings
[ https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948280#comment-13948280 ] Dmitriy Lyubimov commented on MAHOUT-1490: -- Very good. I guess we need a DSL proposal from someone intimately familiar with R data frames. (I guess i qualify for that but i am probably not going to have enough time). Data frame R-like bindings -- Key: MAHOUT-1490 URL: https://issues.apache.org/jira/browse/MAHOUT-1490 Project: Mahout Issue Type: New Feature Reporter: Saikat Kanjilal Original Estimate: 20h Remaining Estimate: 20h Create Data frame R-like bindings for spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MAHOUT-1490) Data frame R-like bindings
[ https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov reassigned MAHOUT-1490: Assignee: Dmitriy Lyubimov Data frame R-like bindings -- Key: MAHOUT-1490 URL: https://issues.apache.org/jira/browse/MAHOUT-1490 Project: Mahout Issue Type: New Feature Reporter: Saikat Kanjilal Assignee: Dmitriy Lyubimov Original Estimate: 20h Remaining Estimate: 20h Create Data frame R-like bindings for spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1489) Interactive Scala Spark Bindings Shell Script processor
[ https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1489: - Summary: Interactive Scala Spark Bindings Shell Script processor (was: Interactive linear algebra shell script processor) Interactive Scala Spark Bindings Shell Script processor --- Key: MAHOUT-1489 URL: https://issues.apache.org/jira/browse/MAHOUT-1489 Project: Mahout Issue Type: New Feature Affects Versions: 1.0 Reporter: Saikat Kanjilal Assignee: Dmitriy Lyubimov Fix For: 1.0 Build an interactive shell /scripting (just like spark shell). Something very similar in R interactive/script runner mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944755#comment-13944755 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 values but that seems smallish. Can't say when I'll get to it but it's on my list. If someone can jump in quicker--have at it. @Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for the flops alone. Did your original version also used matrix squaring? How did it fare? Also, since the flops grow power-law w.r.t input size (it is a problem for ssvd, too) we may need to contemplate a technique that creates finer splits for such computations based on input size. It very well may be the case that original hdfs splits may turn out to be too large for adequate load redistribution. Technically, it is extremely simple -- we'd just have to insert a physical operator tweaking RDD splits via shuffless coalesce() which also costs nothing in Spark. However, i am not sure what would be sensible API for this -- automatic, semi-automatic cost-based... I guess one brainless thing to do is to parameterize drmContext with desired parallelism (~cluster task capacity) and have optimizer to insert physical opertors that very # of partitions and do automatic shuffless coalesce if the number is too low any thoughts? RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 1.0 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944755#comment-13944755 ] Dmitriy Lyubimov edited comment on MAHOUT-1464 at 3/24/14 5:10 AM: --- bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 values but that seems smallish. Can't say when I'll get to it but it's on my list. If someone can jump in quicker--have at it. @Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for the flops alone. Did your original version also used matrix squaring? How did it fare? Also, since the flops grow power-law w.r.t input size (it is a problem for ssvd, too) we may need to contemplate a technique that creates finer splits for such computations based on input size. It very well may be the case that original hdfs splits may turn out to be too large for adequate load redistribution. Technically, it is extremely simple -- we'd just have to insert a physical operator tweaking RDD splits via shuffless coalesce() which also costs nothing in Spark. However, i am not sure what would be sensible API for this -- automatic, semi-automatic cost-based... I guess one brainless thing to do is to parameterize optimizer context with desired parallelism (~cluster task capacity) and have optimizer to insert physical opertors that very # of partitions and do automatic shuffless coalesce if the number is too low any thoughts? was (Author: dlyubimov): bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 values but that seems smallish. Can't say when I'll get to it but it's on my list. If someone can jump in quicker--have at it. @Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for the flops alone. Did your original version also used matrix squaring? How did it fare? Also, since the flops grow power-law w.r.t input size (it is a problem for ssvd, too) we may need to contemplate a technique that creates finer splits for such computations based on input size. It very well may be the case that original hdfs splits may turn out to be too large for adequate load redistribution. Technically, it is extremely simple -- we'd just have to insert a physical operator tweaking RDD splits via shuffless coalesce() which also costs nothing in Spark. However, i am not sure what would be sensible API for this -- automatic, semi-automatic cost-based... I guess one brainless thing to do is to parameterize drmContext with desired parallelism (~cluster task capacity) and have optimizer to insert physical opertors that very # of partitions and do automatic shuffless coalesce if the number is too low any thoughts? RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 1.0 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941511#comment-13941511 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- yeah .. those views.. i think they create at least 2 objects interim... not so cool for mass iterations. Oh well. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941517#comment-13941517 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- I have non-slim A'A. Of course slim operator implementation is upper triangular that cuts outer product computation cost two times in comparison... Significantly wide A'A on the other hand cannot really apply the same cut, since it needs to form rows in distributed way. Not surprisingly, slim test takes 17 seconds and the fat one takes 21 seconds on my fairly ancient computer for squaring 400x550 matrix (single thread). Actually, i expected a little more significant gap. I wonder if there's a more interesting way to do this other than forming outer product vertical blocks. Maybe I need to use square blocks. In this case i can reuse roughly half of them -- but then there will be significantly more objects with this (albeit smaller in size). and then i will have to have an extra shuffle operation to form the lower triangular part of the matrix still. Anyway. i think i will commit what i have. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941520#comment-13941520 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- On Thu, Mar 20, 2014 at 1:42 AM, Sebastian Schelter (JIRA) For that very reason, i almost always use SRM and almost never SM. What i really would probably love is a sparse row and column block (hash hanging from hash), this seems like recurring issue in blocking calculations such as ALS. SRM does always that, except it uses full size vector to hang sprase vectors off. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941531#comment-13941531 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- No, i think blockify is fine. it probably can run a bit faster than it does, but oh well. And mapblock doesn't trigger it (or, rather, it is evaluated lazily; and if previous operator already produced blocks, then blockify is not used). what i was saying is along the lines of A'A computation. There's a structure that is used to fuse operators, which is sort of eitherOr of either DrmRdd or BlockifiedDrmRdd type. I can to conclusion that there are operators that are absolute pain to implement on blocks, and there are that would be pain to implement on row vector bags. But blocks can be presented as row bags via viewing, so conversion to blocks happens only if subsequent operator requires it. What's more, usually block operator outputs blocks as well and vice versa, so realistically blockify happens not so often at all. Another caveat is that one has to be careful with map blocks with side effects on RDD of origin. Even though Spark says all RDDs are immutable, side effects will stay visible to parent RDDs if they are cached as MEMORY_ONLY or MEMORY_AND_DISK (i.e. without mandatory clone-via-serialization in block manager) and then subsequently used as a source again. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941541#comment-13941541 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- Oh, you mean in case of sparse row vectors. You are probably right. indeed, there's currently a SparseMatrix there in this case. I think it should be SparseRowMatrix of course. most of the cases should benefit from it. Problem is, like i said, mapblock doesn't really form it; nor any other physical operator has any knowledge what formed it. It is possible to optimize the entire operator fusion chain based on subsequent operator preferred type, that's actually a very neat idea for in-core speed optimization; but i have no capacity to pursue this technique at the moment. It needs some digestion anyway (at least on my end). It requires experiments with in-core operations. At the first glance, most non-multiplicative operators would be ok with row-wise matrix, as well as deblockifying views. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941545#comment-13941545 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- If anything, at least i see non-negligible speed up in the blockification itself it seems once i use row matrix. I think i will commit that. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941354#comment-13941354 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- Actually, non-slim A'A operator is practically A'B without need for a zip... So we are almost done, the biggest work here is the test I suppose. Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941355#comment-13941355 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- Actually, non-slim A'A operator is practically A'B without need for a zip... So we are almost done, the biggest work here is the test I suppose. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939451#comment-13939451 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- That's what i normally do, yes. The scalabindings issue points to a branch in github. Then there's commit-squash method (described in my blog) i do when pushing to svn. Hopefully we'd see direct git pushes for mahout sooner rather than later. However, seeing a combined (squashed) patch is pretty useful too as apposed to tons of indivdiual commits. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939468#comment-13939468 ] Dmitriy Lyubimov edited comment on MAHOUT-1464 at 3/18/14 4:56 PM: --- @[~ssc] Looking nice. I guess we want non-skinny version of operator A'A still, i may be able to look into it. was (Author: dlyubimov): [~ssc] Looking nice. I guess we want non-skinny version of operator A'A still, i may be able to look into it. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939468#comment-13939468 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- [~ssc] Looking nice. I guess we want non-skinny version of operator A'A still, i may be able to look into it. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939513#comment-13939513 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- http://weatheringthrutechdays.blogspot.com/2011/04/git-github-and-committing-to-asf-svn.html RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937951#comment-13937951 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- I only ever ran spark code with hdfs cluster of cdh 4. Mapreduce api is irrelevant, which is where most of 2.0 vs 1 thing happens, only hdfs is, since spark doesnt need mr cluster. Spark can also run under yarn supervision, which would imply 2.0, but i would strongly recommend against it and use mesos plus zookeeper. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937954#comment-13937954 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- Ps spark module has cdh4 maven profile. RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: (was: ScalaSparkBindings.pdf) Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: ScalaSparkBindings.pdf updating docs to reflect latest committed state. Brought in distributed and in-core stochastic PCA scripts, colmeans, colsums, drm-vector multiplication, more tests etc.etc. see the doc. Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938799#comment-13938799 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- What's the best way to share PDF source? i can put it on the site so committers can re-generate it. otherwise, its source right now in my github doc branch here and pull request is definitely possible way to collaborate too: https://github.com/dlyubimov/mahout-commits/tree/ssvd-docs RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938805#comment-13938805 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- 1. {code} val C = A.t %*% A {code} I don't remember if i actually put in the physical operator for non-skinny A. There are two distinct algorithms to deal with it. Skinny one (n = 5000 or something) uses upper-triangular vector-backed accumulator to combine stuff right in map. Of course if accumulator does not realistically fit in memory then another algorithm has to be plugged in for A-squared. See AtA.scala, def at_a_nongraph(). It currently throws UnsupportedOperation (but everything i have done so far only uses slim A'A) 2. when using partial functions with mapBlock, you actually do not have to use ({...}) but just { }: {code} drmBt = drmBt.mapBlock() { case (keys, block) = //... keys - block } {code} RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) RowSimilarityJob on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938810#comment-13938810 ] Dmitriy Lyubimov commented on MAHOUT-1464: -- Also, just FYI, much as i love to use single-letter capitals for matrices, it turns out Scala is not really ingesting it in all situtations. For example, {code} val (U, V, s) = ssvd(...) {code} doesn't compile. So i ended up using, perhaps verbosely, drmA and inCoreA notations. Perhaps we can agree on what's reasonable . RowSimilarityJob on Spark - Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Environment: hadoop, spark Reporter: Pat Ferrel Labels: performance Fix For: 0.9 Attachments: MAHOUT-1464.patch Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype here: https://gist.github.com/sscdotopen/8314254. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422 which is a feature request for RSJ on two inputs to calculate the similarity of rows of one DRM with those of another. This cross-similarity has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Affects Version/s: (was: 0.8) 0.9 Status: Patch Available (was: Open) Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: ScalaSparkBindings.pdf Ok this is finally done. SSVD is working, notes updated. I will commit it later tonight after additional review for misc stuff. Please look at the final pdf api , and source if needed. This will also contain fix for CholeskyDecomosition bug that always reports degenerate matrix. Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923389#comment-13923389 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- Most of this code is not distributed-tested. Assumption is that we will have to continue working stuff out and gauge bottlenecks of concrete implementations. It is possible additional tuning parameters will be required, esp. for stuff that does blocking etc. So it should be marked as evolving . Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923389#comment-13923389 ] Dmitriy Lyubimov edited comment on MAHOUT-1346 at 3/7/14 5:08 AM: -- Most of this code is not distributed-tested. Unit tests do due diligence and ensure matrices are produced with more than a trivially single partition, and i also verified some stuff on a live single node spark but i haven't tried any significant datasets in a reall life cluster. Assumption is that we will have to continue working stuff out and gauge bottlenecks of concrete implementations. It is possible additional tuning parameters will be required, esp. for stuff that does blocking etc. So it should be marked as evolving . was (Author: dlyubimov): Most of this code is not distributed-tested. Assumption is that we will have to continue working stuff out and gauge bottlenecks of concrete implementations. It is possible additional tuning parameters will be required, esp. for stuff that does blocking etc. So it should be marked as evolving . Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: (was: ScalaSparkBindings.pdf) Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: ScalaSparkBindings.pdf update Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: (was: ScalaSparkBindings.pdf) Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: ScalaSparkBindings.pdf @Sebastian (and et al) could you please review if not the code then at least the API pdf (attached)? At this point i have all functional components to do distributed SSVD in dsl so it is really on the verge of commit, but i wouldn't want do that without no review at all (given how relatively big and conceptual this thing is). Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: ScalaSparkBindings.pdf WIP manual and working notes Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906741#comment-13906741 ] Dmitriy Lyubimov commented on MAHOUT-1365: -- Yeah. I am not sure what they are doing there. Last time i looked at it, MLLib did not have any form of weighed ALS. Now this exapmple seems to include trainImplicit which works on the rating matrix only. In original formulation of implicit feedback problem there were two values, preference and confidence in such preference. So i am not sure what they do there since the input is obviously one sparse matrix. My generalization of the problem includes formulation where any confidence level could be attached to either 0 or 1 as a preference, plus baseline. I also assume that model may have more than one parameter to form confidence which requires fitting as well. (simply speaking what is level of consumption if user clicks on it vs. add-2-cart, if any etc.). Similarly, there could be difference levels of confidence of ignoring stuff depending on situation. So 0 preferences do not have to always have the baseline confidence either. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906752#comment-13906752 ] Dmitriy Lyubimov commented on MAHOUT-1365: -- that's reasonable encoding i suppose. Good idea. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode
[ https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov resolved MAHOUT-1408. -- Resolution: Won't Fix Don't see a reason to do anything. Distributed cache file matching bug while running SSVD in broadcast mode Key: MAHOUT-1408 URL: https://issues.apache.org/jira/browse/MAHOUT-1408 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.8 Reporter: Angad Singh Assignee: Dmitriy Lyubimov Priority: Minor Attachments: BtJob.java.patch The error is: java.lang.IllegalArgumentException: Unexpected file name, unable to deduce partition #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar at org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154) at org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1) at java.util.Arrays.mergeSort(Arrays.java:1270) at java.util.Arrays.mergeSort(Arrays.java:1281) at java.util.Arrays.mergeSort(Arrays.java:1281) at java.util.Arrays.sort(Arrays.java:1210) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:94) at org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) The bug is @ https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java, near line 220. and @ https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java near line 144. SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache will have a particular pattern whereas we have jar files in our distributed cache which causes the above exception. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Fix Version/s: (was: Backlog) 1.0 Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906699#comment-13906699 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- This is now tracked here https://github.com/dlyubimov/mahout-commits/tree/dev-1.0-spark new module spark. I have been rewriting certain things anew. Concepts : (a) Logical operators (including DRM sources) are expressed as DRMLike trait. (b) taking a note from spark book, DRM operators (such as %*% or t) form operator lineage. Operator lineage does not get optimized into RDD until action applied (spark terminology used). (c) Unlike in spark, action doesn't really cause any execution but (1) forming optimized RDD sequence (2) producing checkpointed DRM. Consequently, checkpointed DRM has RDD lineage attached to it, which is also marked for cacheing. Subsequently additional lineages starting out of a checkpointed DRM, will not be able to optimize beyond this checkpoint. (d) there's a super action on checkpointed RDD - such as collection or persitence to HDFS that triggers, if necessary, optimization checkpoint and Spark action. E.g. {code} val A = drmParallelize(...) // doesn't do anything, give opportunity for operator lineage to grow further before being optimized val squaredA = A.t %*% A // we may trigger optimizer and RDD lineage generation and cacheing explicitly by: squaredA.checkpoint() // Or, we can call superAction directly. This will trigger checkpoint() implicitly if not yet done val inCoreSquaredA = squaredA.collect() {code} Generally, i support for very few things -- I actually dropped all previously implemented Bagel algorithms. So in fact i have less support now than in 0.9 branch. i have kryo support for Mahout vectors and matrix blocks. I have hdfs read/write of Mahout's DRM into DRMLike trait. I have some DSL defined such as A %*% B A %*% inCoreB inCoreA %*%: B A.t inCoreA = A.collect A.blockify (coalesces split records into RDD of vertical blocks -- sort of paradigm simiilar to MLI's MatrixSubmatrix except I implemented it before MLI was announced for the first time :) so no MLI influence here in fact ) So now i need to reimplement what Bagel used to be doing, plus optimizer rules for choosing distributed algorithm based on cost rules. In fact i came to conclusion there was 0 benefit in using Bagel in the first place, since it just maps all its primitives into shuffle-and-hash group-by RDD operations so there is no any actual operational benefit to using it. I probably will reconstitute algorithms at the first iteration using regular spark primitives (groupBy and cartesian for multiplication blocks) Once i plug missing pieces (e.g. slim matrix multiplication) I bet i would be able to fit distributed SSVD version in 40 lines just like the in-core one :) Weighted ALS will still be looking less elegant because of some lacking features in linear algebra. For example, it seems like sparse block support (i.e. bunch of sparse row or column vectors hanging off a very small hash map instead of full-size array as in SparseRow(column)Matrix today), but still mostly R-like scripted as far as working with matrix blocks and decompositions. So at this point i'd be willing to hear input on these ideas and direction. Perhaps some suggestions. Thanks. Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906710#comment-13906710 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- a few obvious optimizer rules A.t %*% A is obviously detected as a family of unary algorithsm rather than a binary multiplication alborithm Geometry and non-zero element estimate plays role in selection of type of algorithm. Biggest multiplication via group-by will have to deal, obviously, with cartesian operator and will apply to (A * B') Obvious rewrites: A'*B' = (B * A )' (transposition push-up, including elementwise operators too) (A')' = A (transposition merge) cost based grouping (A*B)*C versus A*(B*C) special distributed algorithm versions for in-core operands and diagonal matrices Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Fix Version/s: (was: Backlog) 1.0 Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906725#comment-13906725 ] Dmitriy Lyubimov edited comment on MAHOUT-1365 at 2/20/14 7:54 AM: --- quite possibly could be. The only thing that i do differently here is the merge of approaches of implicit feedback and wieghed regularization paper, but that's minor. see the pdf. was (Author: dlyubimov): quite possibly could be. The only thing that i do differently here is the merge of approaches of implicit feedback and wieghed regularization paper, but that's minor. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906725#comment-13906725 ] Dmitriy Lyubimov commented on MAHOUT-1365: -- quite possibly could be. The only thing that i do differently here is the merge of approaches of implicit feedback and wieghed regularization paper, but that's minor. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906729#comment-13906729 ] Dmitriy Lyubimov commented on MAHOUT-1365: -- Oh. and the implicit paper doesn't generalize the search for confidence parameters of course. I ignore that formulation here completely. but eventually there should be an outer procedure for search for optimum. My particular problem was including multiple events with generally unknown confidence weights unlike the original implicit feedback work. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1412) Build warning due to multiple Scala versions
[ https://issues.apache.org/jira/browse/MAHOUT-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891606#comment-13891606 ] Dmitriy Lyubimov commented on MAHOUT-1412: -- scalatest has not, and it seems does not build artifacts for 2.9.3 scala. There's an artifact for 2.9.2 and 2.10. They seem to imply they build only one artifact fit for all the 2.9.whatever. Scala is mostly inttroduced to build mixed environment of in-core operations and Spark so it tracks Spark versions of scala and scalatest. Spark just released 0.9.0 and scalatest just released 2.0 -- we can bump to these eventually Build warning due to multiple Scala versions Key: MAHOUT-1412 URL: https://issues.apache.org/jira/browse/MAHOUT-1412 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: Frank Scholten Priority: Minor I see the following build warning: 22:42:07 [WARNING] Expected all dependencies to require Scala version: 2.9.3 22:42:07 [WARNING] org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala version: 2.9.3 22:42:07 [WARNING] org.scalatest:scalatest_2.9.2:1.9.1 requires scala version: 2.9.2 22:42:07 [WARNING] Multiple versions of scala libraries detected! Which version should we use? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889842#comment-13889842 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- Aha. Spark 0.9.0 with GraphX is finally released. Time to get hands dirty a bit in this methinks.. Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode
[ https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov reassigned MAHOUT-1408: Assignee: Dmitriy Lyubimov Distributed cache file matching bug while running SSVD in broadcast mode Key: MAHOUT-1408 URL: https://issues.apache.org/jira/browse/MAHOUT-1408 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.8 Reporter: Angad Singh Assignee: Dmitriy Lyubimov Priority: Minor Attachments: BtJob.java.patch The error is: java.lang.IllegalArgumentException: Unexpected file name, unable to deduce partition #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar at org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154) at org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1) at java.util.Arrays.mergeSort(Arrays.java:1270) at java.util.Arrays.mergeSort(Arrays.java:1281) at java.util.Arrays.mergeSort(Arrays.java:1281) at java.util.Arrays.sort(Arrays.java:1210) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:94) at org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) The bug is @ https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java, near line 220. and @ https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java near line 144. SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache will have a particular pattern whereas we have jar files in our distributed cache which causes the above exception. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode
[ https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880200#comment-13880200 ] Dmitriy Lyubimov commented on MAHOUT-1408: -- I take it you are trying to use SSVD solver in some sort of embedded mode, not a pure Mahout CLI? Still though, i am not sure why you want wrestle control over map reduce from SSVD solver in individual MR steps? Additional jars will not get there (nor they are needed by SSVD jobs). Mahout architecture, in general, and this pipeline in particular, does not assume you get to manipulate individual job settings. This pipeline's step legitimately expects to find the files in the cache that SSVD pipeline has put into it. I would like to place a burden on you to explain why you think SSVD pipeline should expect someone messing up its MR settings. Assuming however your reasons are valid, this (BtJob mr) would not be the only MR case where cache is used in the SSVD pipeline and this patch will not be sufficient to do this throughout. Distributed cache file matching bug while running SSVD in broadcast mode Key: MAHOUT-1408 URL: https://issues.apache.org/jira/browse/MAHOUT-1408 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.8 Reporter: Angad Singh Assignee: Dmitriy Lyubimov Priority: Minor Attachments: BtJob.java.patch The error is: java.lang.IllegalArgumentException: Unexpected file name, unable to deduce partition #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar at org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154) at org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1) at java.util.Arrays.mergeSort(Arrays.java:1270) at java.util.Arrays.mergeSort(Arrays.java:1281) at java.util.Arrays.mergeSort(Arrays.java:1281) at java.util.Arrays.sort(Arrays.java:1210) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:94) at org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) The bug is @ https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java, near line 220. and @ https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java near line 144. SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache will have a particular pattern whereas we have jar files in our distributed cache which causes the above exception. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable
[ https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1397: - Resolution: Invalid Status: Resolved (was: Patch Available) mahaout-math-scala/pom.xml not readable --- Key: MAHOUT-1397 URL: https://issues.apache.org/jira/browse/MAHOUT-1397 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Windows 7 Professional 64 bit Eclipse: Version: Kepler Service Release 1 Build id: 20130919-0819 maven 3.0.5 Java: jdk1.6.0_45 Reporter: Maruf Aytekin Assignee: Dmitriy Lyubimov Labels: maven Fix For: 1.0 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error. {code} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration sourceDirsrc/main/scala/sourceDir jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx1024m/jvmArg /jvmArgs /configuration /plugin {code} Error displayed: {quote} Multiple annotations found at this line: - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: compile) - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test- compile) {quote} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable
[ https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876145#comment-13876145 ] Dmitriy Lyubimov commented on MAHOUT-1397: -- on a side note -- i spent more than a decade with eclipse. It was scala and maven support in eclipse (or, rather, lack of thereof) that finally forced my hand to switch. mahaout-math-scala/pom.xml not readable --- Key: MAHOUT-1397 URL: https://issues.apache.org/jira/browse/MAHOUT-1397 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Windows 7 Professional 64 bit Eclipse: Version: Kepler Service Release 1 Build id: 20130919-0819 maven 3.0.5 Java: jdk1.6.0_45 Reporter: Maruf Aytekin Assignee: Dmitriy Lyubimov Labels: maven Fix For: 1.0 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error. {code} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration sourceDirsrc/main/scala/sourceDir jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx1024m/jvmArg /jvmArgs /configuration /plugin {code} Error displayed: {quote} Multiple annotations found at this line: - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: compile) - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test- compile) {quote} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable
[ https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875085#comment-13875085 ] Dmitriy Lyubimov commented on MAHOUT-1397: -- you sure? i trust what idea prompts me. Ok let me check. mahaout-math-scala/pom.xml not readable --- Key: MAHOUT-1397 URL: https://issues.apache.org/jira/browse/MAHOUT-1397 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Windows 7 Professional 64 bit Eclipse: Version: Kepler Service Release 1 Build id: 20130919-0819 maven 3.0.5 Java: jdk1.6.0_45 Reporter: Maruf Aytekin Labels: maven Fix For: 1.0 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error. {code} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration sourceDirsrc/main/scala/sourceDir jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx1024m/jvmArg /jvmArgs /configuration /plugin {code} Error displayed: {quote} Multiple annotations found at this line: - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: compile) - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test- compile) {quote} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable
[ https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875089#comment-13875089 ] Dmitriy Lyubimov commented on MAHOUT-1397: -- Hm. Either IntelliJ is wrong, or you. mahaout-math-scala/pom.xml not readable --- Key: MAHOUT-1397 URL: https://issues.apache.org/jira/browse/MAHOUT-1397 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Windows 7 Professional 64 bit Eclipse: Version: Kepler Service Release 1 Build id: 20130919-0819 maven 3.0.5 Java: jdk1.6.0_45 Reporter: Maruf Aytekin Labels: maven Fix For: 1.0 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error. {code} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration sourceDirsrc/main/scala/sourceDir jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx1024m/jvmArg /jvmArgs /configuration /plugin {code} Error displayed: {quote} Multiple annotations found at this line: - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: compile) - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test- compile) {quote} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable
[ https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov reassigned MAHOUT-1397: Assignee: Dmitriy Lyubimov mahaout-math-scala/pom.xml not readable --- Key: MAHOUT-1397 URL: https://issues.apache.org/jira/browse/MAHOUT-1397 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Windows 7 Professional 64 bit Eclipse: Version: Kepler Service Release 1 Build id: 20130919-0819 maven 3.0.5 Java: jdk1.6.0_45 Reporter: Maruf Aytekin Assignee: Dmitriy Lyubimov Labels: maven Fix For: 1.0 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error. {code} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration sourceDirsrc/main/scala/sourceDir jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx1024m/jvmArg /jvmArgs /configuration /plugin {code} Error displayed: {quote} Multiple annotations found at this line: - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: compile) - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test- compile) {quote} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1397) mahaout-math-scala/pom.xml not readable
[ https://issues.apache.org/jira/browse/MAHOUT-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875091#comment-13875091 ] Dmitriy Lyubimov commented on MAHOUT-1397: -- also: http://scala-tools.org/mvnsites/maven-scala-plugin/usage.html {code:title=correct usage example} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration scalaVersion${scala.version}/scalaVersion /configuration /plugin {code} Sorry, I think you need to try to be more convincing. -d mahaout-math-scala/pom.xml not readable --- Key: MAHOUT-1397 URL: https://issues.apache.org/jira/browse/MAHOUT-1397 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Windows 7 Professional 64 bit Eclipse: Version: Kepler Service Release 1 Build id: 20130919-0819 maven 3.0.5 Java: jdk1.6.0_45 Reporter: Maruf Aytekin Labels: maven Fix For: 1.0 maven-scala-plugin in mahaout-math-scala/pom.xml gives an error. {code} plugin groupIdorg.scala-tools/groupId artifactIdmaven-scala-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration sourceDirsrc/main/scala/sourceDir jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx1024m/jvmArg /jvmArgs /configuration /plugin {code} Error displayed: {quote} Multiple annotations found at this line: - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:compile (execution: default, phase: compile) - Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test- compile) {quote} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1305) Rework the wiki
[ https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861764#comment-13861764 ] Dmitriy Lyubimov commented on MAHOUT-1305: -- By all means I agree. I also still owe migration for scala bindings of Mahout's math per M-1297 (I guess last time i was thrown off by same CMS issues On Thu, Jan 2, 2014 at 2:39 PM, Isabel Drost-Fromm (JIRA) Rework the wiki --- Key: MAHOUT-1305 URL: https://issues.apache.org/jira/browse/MAHOUT-1305 Project: Mahout Issue Type: Bug Components: Documentation Reporter: Sebastian Schelter Priority: Blocker Fix For: 0.9 Attachments: MAHOUT-221213-1315-15716.pdf We should think about completely redoing our wiki. At the moment, we're listing lots of algorithms that we either never implemented or already removed. I also have the impression that a lot of stuff is outdated. It would be awesome if we had an up-to-date documentation of the code with instructions on how to get into using mahout quickly. We should also have examples for all our 3 C's. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1305) Rework the wiki
[ https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861764#comment-13861764 ] Dmitriy Lyubimov edited comment on MAHOUT-1305 at 1/3/14 6:50 PM: -- By all means I agree. I also still owe migration for scala bindings of Mahout's math per M-1297 (I guess last time i was thrown off by some CMS issues) On Thu, Jan 2, 2014 at 2:39 PM, Isabel Drost-Fromm (JIRA) was (Author: dlyubimov): By all means I agree. I also still owe migration for scala bindings of Mahout's math per M-1297 (I guess last time i was thrown off by same CMS issues On Thu, Jan 2, 2014 at 2:39 PM, Isabel Drost-Fromm (JIRA) Rework the wiki --- Key: MAHOUT-1305 URL: https://issues.apache.org/jira/browse/MAHOUT-1305 Project: Mahout Issue Type: Bug Components: Documentation Reporter: Sebastian Schelter Priority: Blocker Fix For: 0.9 Attachments: MAHOUT-221213-1315-15716.pdf We should think about completely redoing our wiki. At the moment, we're listing lots of algorithms that we either never implemented or already removed. I also have the impression that a lot of stuff is outdated. It would be awesome if we had an up-to-date documentation of the code with instructions on how to get into using mahout quickly. We should also have examples for all our 3 C's. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1305) Rework the wiki
[ https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860342#comment-13860342 ] Dmitriy Lyubimov commented on MAHOUT-1305: -- SSVD pages https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition along with all attachements and references must be retained. I spent a lot of time writing instructions and explanations there. In fact, it is flagship method now for dimensionality reduction, PCA and LSA-like problems . I inserted references to PCA, dimensionality reduction and SVD pages to this method as a first try over Lanczos and see these references also gone now in the public version. Rework the wiki --- Key: MAHOUT-1305 URL: https://issues.apache.org/jira/browse/MAHOUT-1305 Project: Mahout Issue Type: Bug Components: Documentation Reporter: Sebastian Schelter Priority: Blocker Fix For: 0.9 Attachments: MAHOUT-221213-1315-15716.pdf We should think about completely redoing our wiki. At the moment, we're listing lots of algorithms that we either never implemented or already removed. I also have the impression that a lot of stuff is outdated. It would be awesome if we had an up-to-date documentation of the code with instructions on how to get into using mahout quickly. We should also have examples for all our 3 C's. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf updating, minor errors in pdf doc Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Description: Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. was: Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832110#comment-13832110 ] Dmitriy Lyubimov edited comment on MAHOUT-1365 at 11/26/13 9:15 PM: Oh. One thing to mention is that the confidence matrix C is not sparse per se. but if there's a base confidence c_0 such that subtracting it from each element of C turns it into sparse matrix C', then we can use that matrix as an input (along with c_0 parameter). This is further clarified in the attachment (which is basically just a conspectus of both papers for my own sake.) See attached. was (Author: dlyubimov): Oh. One thing to mention is that the confidence matrix C is not sparse per se. but if there's a base confidence c_0 such that subtracting it from each element of C turns it into sparse matrix C', then we can use that matrix as an input (along with c_0 parameter). This is further clarified in the attachment (which is basically just a conspect of both papers for my own sake.) See attached. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is following ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1297) New module for linear algebra scala DSL (in-core operators support only to start with)
[ https://issues.apache.org/jira/browse/MAHOUT-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13833264#comment-13833264 ] Dmitriy Lyubimov commented on MAHOUT-1297: -- yes. Spark 0.8 still has 2.9.3 scala dependency. since this issue is really a dependency crutch for spark-distributed issues (at this point anyway), hence. New module for linear algebra scala DSL (in-core operators support only to start with) -- Key: MAHOUT-1297 URL: https://issues.apache.org/jira/browse/MAHOUT-1297 Project: Mahout Issue Type: New Feature Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 0.9 See initial set of in-core R-like operations here http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html. A separate DSL for matlab-like syntax is being developed. The differences here are about replacing R-like %*% with * and finding another way to express elementwise * and /. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MAHOUT-1363) Rebase packages in mahout-scala
Dmitriy Lyubimov created MAHOUT-1363: Summary: Rebase packages in mahout-scala Key: MAHOUT-1363 URL: https://issues.apache.org/jira/browse/MAHOUT-1363 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Priority: Minor Fix For: 0.9 It has occurred to me that in my commit of mahout-scala stuff, i haven't rebased packages onto o.a.m... as has been discussed. it also has occurred to me that putting that stuff into o.a.m.math in this case may create unwelcome interference between java and scala stuff. So I am moving scala math DSL stuff into 0.a.m.math.scalabindings package. It is awfully awkward compared to just mahout.math scala style package it bears now, but i guess modern IDE tools make it no problem to import. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1363) Rebase packages in mahout-scala
[ https://issues.apache.org/jira/browse/MAHOUT-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1363: - Status: Patch Available (was: Open) Rebase packages in mahout-scala --- Key: MAHOUT-1363 URL: https://issues.apache.org/jira/browse/MAHOUT-1363 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Priority: Minor Fix For: 0.9 It has occurred to me that in my commit of mahout-scala stuff, i haven't rebased packages onto o.a.m... as has been discussed. it also has occurred to me that putting that stuff into o.a.m.math in this case may create unwelcome interference between java and scala stuff. So I am moving scala math DSL stuff into 0.a.m.math.scalabindings package. It is awfully awkward compared to just mahout.math scala style package it bears now, but i guess modern IDE tools make it no problem to import. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1363) Rebase packages in mahout-scala
[ https://issues.apache.org/jira/browse/MAHOUT-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1363: - Resolution: Fixed Status: Resolved (was: Patch Available) Rebase packages in mahout-scala --- Key: MAHOUT-1363 URL: https://issues.apache.org/jira/browse/MAHOUT-1363 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Priority: Minor Fix For: 0.9 It has occurred to me that in my commit of mahout-scala stuff, i haven't rebased packages onto o.a.m... as has been discussed. it also has occurred to me that putting that stuff into o.a.m.math in this case may create unwelcome interference between java and scala stuff. So I am moving scala math DSL stuff into 0.a.m.math.scalabindings package. It is awfully awkward compared to just mahout.math scala style package it bears now, but i guess modern IDE tools make it no problem to import. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
Dmitriy Lyubimov created MAHOUT-1365: Summary: Weighted ALS-WR iterator for Spark Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Oh. the confidence matrix C is not sparse per se. but if there's a base confidence c_0 such that subtracting it from each element of C turns it into sparse matrix C', then we can use that matrix as an input (along with c_0 parameter). This is further clarified in the attachment (which is basically just a conspect of both papers for my own sake.) See attached. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832110#comment-13832110 ] Dmitriy Lyubimov edited comment on MAHOUT-1365 at 11/26/13 12:26 AM: - Oh. One thing to mention is that the confidence matrix C is not sparse per se. but if there's a base confidence c_0 such that subtracting it from each element of C turns it into sparse matrix C', then we can use that matrix as an input (along with c_0 parameter). This is further clarified in the attachment (which is basically just a conspect of both papers for my own sake.) See attached. was (Author: dlyubimov): Oh. the confidence matrix C is not sparse per se. but if there's a base confidence c_0 such that subtracting it from each element of C turns it into sparse matrix C', then we can use that matrix as an input (along with c_0 parameter). This is further clarified in the attachment (which is basically just a conspect of both papers for my own sake.) See attached. Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.lyx Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.lyx) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: (was: distributed-als-with-confidence.pdf) Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832148#comment-13832148 ] Dmitriy Lyubimov commented on MAHOUT-1365: -- there's obviously some stuff that needs tidying up for the sake of the public. Some stuff (like RMSE function) looks outwardly cryptic to me, after some time has passed since i did this Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1361) Online algorithm for computing accurate Quantiles using 1-D clustering
[ https://issues.apache.org/jira/browse/MAHOUT-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830479#comment-13830479 ] Dmitriy Lyubimov commented on MAHOUT-1361: -- Ted, it's my understanding current code works on double values (integers). Do you think it is possible to adapt it to a lexicographical order of unlimited values? Thank you. Online algorithm for computing accurate Quantiles using 1-D clustering -- Key: MAHOUT-1361 URL: https://issues.apache.org/jira/browse/MAHOUT-1361 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.9 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 0.9 Attachments: MAHOUT-1361.patch Implementation of Ted Dunning's paper and initial work on this subject. See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf for the paper. An on-line algorithm for computing approximations of rank-based statistics that allows controllable accuracy. This algorithm can also be used to compute hybrid statistics such as trimmed means in addition to computing arbitrary quantiles. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1361) Online algorithm for computing accurate Quantiles using 1-D clustering
[ https://issues.apache.org/jira/browse/MAHOUT-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828249#comment-13828249 ] Dmitriy Lyubimov commented on MAHOUT-1361: -- interesting. i've been using minCountSketch quantiles with a fairly ok results . How does it compare in effort/precision to minCountSketch and similar sketchlike stuff? Online algorithm for computing accurate Quantiles using 1-D clustering -- Key: MAHOUT-1361 URL: https://issues.apache.org/jira/browse/MAHOUT-1361 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.9 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 0.9 Attachments: MAHOUT-1361.patch Implementation of Ted Dunning's paper and initial work on this subject. See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf for the paper. An on-line algorithm for computing approximations of rank-based statistics that allows controllable accuracy. This algorithm can also be used to compute hybrid statistics such as trimmed means in addition to computing arbitrary quantiles. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1317) Clarify some of the messages in Preconditions.checkArgument
[ https://issues.apache.org/jira/browse/MAHOUT-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828276#comment-13828276 ] Dmitriy Lyubimov commented on MAHOUT-1317: -- Seems useful. I think i saw inconsistent indentation here and there, but it seems it is notoriously difficult to agree on things like function parameter indentation style etc. Clarify some of the messages in Preconditions.checkArgument --- Key: MAHOUT-1317 URL: https://issues.apache.org/jira/browse/MAHOUT-1317 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: BFL Assignee: Sebastian Schelter Priority: Minor Fix For: 0.9 Attachments: MAHOUT-1317.patch In experimenting with things, I was getting some errors from RowSimilarityJob, that in looking at the source I realized were a little incomplete as to what the true issue was. In this case, they were of the form: Preconditions.checkArgument(maxSimilaritiesPerRow 0, Incorrect maximum number of similarities per row!); Here, it is known that the actual issue is that the parameter must be zero (or negative), not just that it's incorrect, and a (trivial) change to the error message might save some folks some time... especially newbies like myself. A quick grep of the code showed a few more cases like that across the code base that would be (apparently) easy to fix and maybe save folks time when they get the relevant error. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameter over user's preference. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475 ] Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:21 PM: https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameter over user's preference. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy was (Author: dlyubimov): https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameter over user's preference. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820483#comment-13820483 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- P.S. I am kind of dubious step-recorded search would be of sufficient efficiency either. First, we should not assume we are running a good convex landscape. Second, i assume step-recorded search may take fairly long . Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475 ] Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:28 PM: https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy was (Author: dlyubimov): https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameter over user's preference. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here
[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475 ] Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:31 PM: https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . Posing such a problem presents a whole new look at Big Data ML problems. Now we are using distributed processing not just because the input might be so big, but also because we have a lot of parameter space exploration to do (even if the one iteration problem is not so big). And thus produce more interesting analytical results. However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy was (Author: dlyubimov): https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe
[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820475#comment-13820475 ] Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:50 PM: https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . Posing such a problem presents a whole new look at Big Data ML problems. Now we are using distributed processing not just because the input might be so big, but also because we have a lot of parameter space exploration to do (even if the one iteration problem is not so big). And thus produce more interesting analytical results. However, since there are many parameters, the task becomes fairly more interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various online convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start running parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough to get a general sense where global maximum may be even on inputs of a fairly small size. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy was (Author: dlyubimov): https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . Posing such a problem presents a whole new look at Big Data ML problems. Now we are using distributed processing not just because the input might be so big, but also because we have a lot of parameter space exploration to do (even
[jira] [Updated] (MAHOUT-1297) New module for linear algebra scala DSL (in-core operators support only to start with)
[ https://issues.apache.org/jira/browse/MAHOUT-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1297: - Status: Patch Available (was: Open) New module for linear algebra scala DSL (in-core operators support only to start with) -- Key: MAHOUT-1297 URL: https://issues.apache.org/jira/browse/MAHOUT-1297 Project: Mahout Issue Type: New Feature Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 0.9 See initial set of in-core R-like operations here http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html. A separate DSL for matlab-like syntax is being developed. The differences here are about replacing R-like %*% with * and finding another way to express elementwise * and /. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820705#comment-13820705 ] Dmitriy Lyubimov commented on MAHOUT-1346: -- can that context be part of Mahout? Or that would be way off ? Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (MAHOUT-1299) Add optimized versions of timesLeft(), timesRight() to SparseRow~,SparseColMatrices and binary times() operation in general
[ https://issues.apache.org/jira/browse/MAHOUT-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov resolved MAHOUT-1299. -- Resolution: Won't Fix Probably needs to be considered in a broader light of matrix related optimizations. Add optimized versions of timesLeft(), timesRight() to SparseRow~,SparseColMatrices and binary times() operation in general --- Key: MAHOUT-1299 URL: https://issues.apache.org/jira/browse/MAHOUT-1299 Project: Mahout Issue Type: New Feature Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Priority: Minor Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1297) New module for linear algebra scala DSL (in-core operators support only to start with)
[ https://issues.apache.org/jira/browse/MAHOUT-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814337#comment-13814337 ] Dmitriy Lyubimov edited comment on MAHOUT-1297 at 11/5/13 10:31 PM: Separating this into its own branch from scala head work. it is now tracked in https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1297 i will be committing this to the trunk within a week. was (Author: dlyubimov): Separating this into its own branch from scala head work. it is now tracked in https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1297 i will be committing this to the trunk in the next week. New module for linear algebra scala DSL (in-core operators support only to start with) -- Key: MAHOUT-1297 URL: https://issues.apache.org/jira/browse/MAHOUT-1297 Project: Mahout Issue Type: New Feature Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 0.9 See initial set of in-core R-like operations here http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html. A separate DSL for matlab-like syntax is being developed. The differences here are about replacing R-like %*% with * and finding another way to express elementwise * and /. -- This message was sent by Atlassian JIRA (v6.1#6144)