[jira] Resolved: (MAHOUT-393) Distributed item similarity functions

2010-05-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-393. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Done, I committed with only

[jira] Commented: (MAHOUT-393) Distributed item similarity functions

2010-05-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865599#action_12865599 ] Sean Owen commented on MAHOUT-393: -- Unless, I missed something, and the unit tests don't

Re: meaning of isSequentialAccess?

2010-05-08 Thread Sean Owen
It returns 'true' since it can be iterated in order efficiently -- really it also implies that the iterators iterate in order. For purposes of serialization, that bit of information is unused since it is encoded with a dense representation. I can quantify the meaning of these flags more in the

[jira] Commented: (MAHOUT-392) Test cases for logGamma, Distribution.normal and Distribution.beta, fix for Distribution.normal

2010-05-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865424#action_12865424 ] Sean Owen commented on MAHOUT-392: -- Tiny comments: - You could inline those b0, b1, etc

[jira] Updated: (MAHOUT-376) Implement Map-reduce version of stochastic SVD

2010-05-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-376: - Issue Type: Improvement (was: Bug) Assignee: Ted Dunning Fix Version/s: 0.4

[jira] Commented: (MAHOUT-392) Test cases for logGamma, Distribution.normal and Distribution.beta, fix for Distribution.normal

2010-05-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865429#action_12865429 ] Sean Owen commented on MAHOUT-392: -- I won't argue about it, since it's tiny

Re: Build failed in Hudson: Mahout Trunk #615

2010-05-07 Thread Sean Owen
Something remains screwed up here and it's ultimately my fault. My utils/ is not building. I am on it though, will address this pronto. On Fri, May 7, 2010 at 10:43 AM, Apache Hudson Server hud...@hudson.zones.apache.org wrote: See

[jira] Resolved: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-302. -- Resolution: Fixed Change tests to use temp directories instead of output, testdata

[jira] Commented: (MAHOUT-389) UncenteredCosineSimilarity

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864812#action_12864812 ] Sean Owen commented on MAHOUT-389: -- I'm happy to commit this since it looks fine, suits

[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864851#action_12864851 ] Sean Owen commented on MAHOUT-391: -- Hmm, I got similar results from a crude test

[jira] Resolved: (MAHOUT-389) UncenteredCosineSimilarity

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-389. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Committed patch #3 with some

Re: BUILD FAILURE

2010-05-06 Thread Sean Owen
Blast, I'll take a look. I knew it was too easy. I was not seeing such failures but from the stack trace maybe I can figure out what's up. On Thu, May 6, 2010 at 8:45 PM, Tamas Jambor jambo...@googlemail.com wrote: just updated the SVN to get Sean's implemetation, but now it fails to build the

Re: [jira] Commented: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-05 Thread Sean Owen
I actually didn't intend to introduce variables used only once. I could have made an error but usually it was because some expression needed to be a Path, and was used at least twice. Sometimes I might have done it for parallel style consistency across several methods. So at least we agree there.

[jira] Updated: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-391: - Attachment: MAHOUT-391.patch Make vector more space efficient with variable-length encoding, et al

[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864303#action_12864303 ] Sean Owen commented on MAHOUT-391: -- Bleh, that's not much of a difference at all

[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864308#action_12864308 ] Sean Owen commented on MAHOUT-391: -- Oh I get it. My other outstanding patch for MAHOUT-302

[jira] Updated: (MAHOUT-389) UncenteredCosineSimilarity

2010-05-04 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-389: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) I should add you could emulate

[jira] Updated: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-04 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-302: - Attachment: MAHOUT-302.patch Holy moly this took a lot of work. What I attempt to do is centralize all

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
Not surprising indeed, that won't scale at some point. What is the stage that needs everything in memory? maybe describing that helps imagine solutions. The typical reason for this, in my experience back in the day, was needing to look up data infrequently in a key-value way. Side-loading off

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
How about this for the first phase? I think you can imagine how the rest goes, more later... Mapper 1A. map() input: One canopy map() output: canopy ID - canopy Mapper 1B. Has in memory all canopy IDs, read at startup) map() input: one point map() output: for each canopy ID, canopy ID - point

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
As I said, you can imagine how the rest goes -- this is a taste of how you might distribute the key piece of the computation you asked about, and certainly does that correctly. It is not the whole algorithm of course -- up to you. On Sun, May 2, 2010 at 1:52 PM, Robin Anil robin.a...@gmail.com

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
What's the specific improvement idea? Size and speed improvements would be good. The Hadoop serialization mechanism is already pretty low-level, dealing directly in bytes (as opposed to fancier stuff like Avro). It's if anything fast and lean but quite manual. The latest Writable updates squeezed

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
That's the one! I actually didn't know this was how PBs did the variable length encoding but makes sense, it's about the most efficient thing I can imagine. Values up to 16,383 fit in two bytes, which less than a 4-byte int and the 3 bytes or so it would take the other scheme. Could add up over

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
That much is expected right? Since it stores a 4-byte index along with each 8-byte double value, the sparse representation is bigger when over 8/(4+8) = 66% of the values are non-default / non-zero. But variable-encoding the index value trims a byte or more per element depending on your

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
It's the same approach to variable-length encoding, yes. Zig-zag is a trick to make negative numbers compatible with this encoding. Because two's-complement negative numbers start with a bunch of 1s their representation is terrible under this variable-length encoding -- always of maximum length.

Re: Unsubscribe to MAHOUT

2010-04-30 Thread Sean Owen
http://www.apache.org/foundation/mailinglists.html To get off a list, send a message to list-unsubscr...@apache.org So you need to mail mahout-dev-unsubscr...@apache.org. There is nobody who can manually answer your request.

Re: Similarity Tests Failing since 939074?

2010-04-29 Thread Sean Owen
Sorry that's essentially an elaborate typo, which made something that Can't Possibly Change Behavior, Change Behavior. On Thu, Apr 29, 2010 at 4:12 AM, Jeff Eastman j...@windwardsolutions.com wrote: Failed tests:  

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
(I can easily make the fix and add a test, but is the right thing to return 0, or instead proceed in the method with the value -sqrt(-llr) when llr is negative?) On Thu, Apr 29, 2010 at 12:44 PM, Shashikant Kore shashik...@gmail.com wrote: Root LLR calculation has a minor bug. When LLR score is

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
What about Shashikant's example? Unless my brain's not in gear, that seems like a legit example, but does indeed product a negative LLR.

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
Ah yeah that's it. So... is the better change to cap the result of logLikelihoodRatio() at 0.0? On Thu, Apr 29, 2010 at 5:11 PM, Ted Dunning ted.dunn...@gmail.com wrote: I suspect round-off error.  In R I get this for the raw LLR: llr(matrix(c(6,7567, 1924, 2426487), nrow=2)) [1]

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
optimization. Shall I commit something like that, but also cap the LLR at 0 anyhow? that fixes the original issue for sure. On Thu, Apr 29, 2010 at 5:28 PM, Sean Owen sro...@gmail.com wrote: Ah yeah that's it. So... is the better change to cap the result of logLikelihoodRatio() at 0.0? On Thu

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
You mean sum * Math.log(sum)? That's nice, I'll go with that. javac definitely isn't allowed to do that kind of transformation -- it actually can't do much of anything. ProGuard might -- it's actually a dynamite byte code optimizer and I've been itching to get it re-integrated into the build for

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
I could sure be wrong about this (or perhaps out of date). It makes sense in theory. But I can't find it in the JLS and in the bytecode I still see it calling Math.log(), calling StrictMath.log(), FWIW. I would actually believe a JIT would do something with this. But I still find myself always

Re: Intermittant Test Failure: testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)

2010-04-29 Thread Sean Owen
I had taken on MAHOUT-302 which is basically about overhauling how temp data is handled for tests. I think we can indeed handle it more cleanly and in a way such that collisions never happen. I'm still in the middle of it. On Thu, Apr 29, 2010 at 11:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen
scala here as the values are fixed, so I thought that a centering of the data would not be necessary. Regards, Sebastian Sean Owen (JIRA) schrieb:      [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen
question. Sean On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote: Actually scratch that patch I sent over. I see the trick now that makes the existing approach quite good. I think I can make a version that preserves that trick and still streamlines the processing. I

[jira] Commented: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861276#action_12861276 ] Sean Owen commented on MAHOUT-297: -- If I may nit-pick: new RandomAccessSparseVector

[jira] Resolved: (MAHOUT-386) org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob breaks when no usersFile is supplied

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-386. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Yeah looks like the old

[jira] Resolved: (MAHOUT-329) Implement some recommendation ideas used by the Netflix top teams to boost the recommenders package

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-329. -- Assignee: Robin Anil Resolution: Later Shelving this as GSoC projects are set and if something

[jira] Resolved: (MAHOUT-354) make the output of RecommenderJob more readable

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-354. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed make the output

[jira] Resolved: (MAHOUT-359) org.apache.mahout.cf.taste.hadoop.item.RecommenderJob for Boolean recommendation

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-359. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed

[jira] Assigned: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned MAHOUT-302: Assignee: Sean Owen Change tests to use temp directories instead of output, testdata

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861381#action_12861381 ] Sean Owen commented on MAHOUT-305: -- I see, fair enough. Even for this simplistic initial

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-387: - Status: Resolved (was: Patch Available) Assignee: Sean Owen Fix Version/s: 0.3

[jira] Resolved: (MAHOUT-385) Unify Vector Writables

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-385. -- Assignee: Sean Owen Resolution: Fixed I think this is uncontroversial enough to commit. It means

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860914#action_12860914 ] Sean Owen commented on MAHOUT-305: -- OK, I think I get the (item1,item2) - (item2,count

[jira] Created: (MAHOUT-385) Unify Vector Writables

2010-04-26 Thread Sean Owen (JIRA)
Unify Vector Writables -- Key: MAHOUT-385 URL: https://issues.apache.org/jira/browse/MAHOUT-385 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.3 Reporter: Sean Owen

[jira] Updated: (MAHOUT-385) Unify Vector Writables

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-385: - Attachment: MAHOUT-385.patch Unify Vector Writables -- Key

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861087#action_12861087 ] Sean Owen commented on MAHOUT-371: -- Looks like this was accept to GSoC, nice. Let

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861095#action_12861095 ] Sean Owen commented on MAHOUT-305: -- I'm about to commit another pass at this since it's

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861144#action_12861144 ] Sean Owen commented on MAHOUT-305: -- Ted says he likes LLR, and doesn't like throwing out

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861154#action_12861154 ] Sean Owen commented on MAHOUT-371: -- Your schedule maps it out well. In the next month, get

Re: Clean checkout Test broken

2010-04-25 Thread Sean Owen
I'm not seeing it in my client, hmm. While I'd tend to guess my change broke it, I don't see the direct link... this code writes TreeID - MapredOutput in its test and then tries to read exactly that. I don't yet see how the SequenceFile.Reader expects anything related to VectorWritable nor why it

Re: How to tackle Vector-NamedVector and back conversion

2010-04-25 Thread Sean Owen
PS let's see a patch to keep discussing, I'm seeing ideas on lots of good topics here and want to take the opportunity to strike while the iron is hot and continue overhauling this. But things like making everything a named vector is sort of stepping backwards to where we just agreed to move from

Re: How to tackle Vector-NamedVector and back conversion

2010-04-25 Thread Sean Owen
Yes, I think if we can convince ourselves that there won't be that many different possibilities for representing a vector, then a simple boolean might unify everything. This approach doesn't 'scale' but I don't know there are other representations we must have. The issue of named vectors is

Re: How to tackle Vector-NamedVector and back conversion

2010-04-25 Thread Sean Owen
I agree that it'd be good to kind of finalize the Vector stuff. I don't think it's reasonable for users to expect data output by 0.3 to be compatible with 0.4 though, so wouldn't worry about that. I think we're on the verge of wanting a proper serialization system like Avro for vectors here --

Re: How to tackle Vector-NamedVector and back conversion

2010-04-25 Thread Sean Owen
Where though, I just deleted all the methods to try it and every test passes. On Sun, Apr 25, 2010 at 7:51 PM, Robin Anil robin.a...@gmail.com wrote: Its used in clustering to generate clusterid - point id. Also to be used in classification(by end of this summer) to keep class labels.

Re: How to tackle Vector-NamedVector and back conversion

2010-04-24 Thread Sean Owen
NamedVectorWritable already extends VectorWritable, though honestly I don't like that and kept it to minimize disruption. Serialized vector formats aren't exactly polymorphic. I can't read and X vector with the code intended to deserialize something that extends X. So, really the Writables

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-23 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860223#action_12860223 ] Sean Owen commented on MAHOUT-305: -- And now more thoughts: Yes all the code is checked

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-23 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860284#action_12860284 ] Sean Owen commented on MAHOUT-305: -- What do you mean about the secondary sort

Re: Mahout In Action

2010-04-23 Thread Sean Owen
I think the goal is that the book is completely up to date with the code as of the day we have to send it to press. That will be right about 0.4, which I assume happens at the end of the summer after GSoC is digested. I just submitted changes today to match my changes this morning. If any of you

[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

2010-04-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859703#action_12859703 ] Sean Owen commented on MAHOUT-384: -- Let's also think about where it fits into the project

[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

2010-04-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859731#action_12859731 ] Sean Owen commented on MAHOUT-384: -- What do others think of 'outlier' -- is this a concept

[jira] Resolved: (MAHOUT-316) CardinalityException and IndexException should remove the default constructor, and always construct with arguments saying what the error was

2010-04-21 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-316. -- Assignee: Sean Owen Resolution: Fixed Good idea, I made this happen. CardinalityException

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-21 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Status: Resolved (was: Patch Available) Fix Version/s: 0.4 (was: 0.3

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-20 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12858807#action_12858807 ] Sean Owen commented on MAHOUT-379: -- I'd like to commit this patch as it addresses a couple

Re: SnowballAnalyzer

2010-04-20 Thread Sean Owen
Yes, you can discover the available constructors and their parameters. But I don't think that it make sense in general to just pass null / 0 to parameters or guess at dummy values. It'd be as likely to cause even subtler errors. I think what you have to do here is extend SnowballAnalyzer, where

[jira] Resolved: (MAHOUT-356) ClassNotFoundException: org.apache.mahout.math.function.IntDoubleProcedure

2010-04-19 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-356. -- Assignee: Sean Owen Resolution: Cannot Reproduce ClassNotFoundException

Re: AbstractVector.minus(Vector)

2010-04-19 Thread Sean Owen
On Mon, Apr 19, 2010 at 5:33 PM, Jake Mannix jake.man...@gmail.com wrote: result.times(-1.0) with result.assign(Functions.negate) Cool, good one. The efficiency points are twofold: number of nonzero elements, and the impl: you don't want to iterate over a vector of any type while

[jira] Resolved: (MAHOUT-381) org.apache.mahout.cf.taste.hadoop.item is more misleading

2010-04-19 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-381. -- Fix Version/s: 0.3 Resolution: Not A Problem While I think the package name is fine, and do

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
Yeah why don't I have a crack at this. The change as it stands is already too big for what it is (though I believe they're good changes.) Then we look at more changes, and sounds like there are several ideas for streamlining vectors, which is a great thing to think about at this early stage. On

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I mean wrapping Vector in a NamedVector. It seems like a good step forward, even as I agree that it probably isn't even needed. Since I'm the one ripping up the floor-boards here to do some plumbing, seems like it should fall on me to put things back into a similar working state with NamedVector.

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On keeping 'name': sure, I don't mind being conservative. I would like to keep name in the form on NamedVector. As it happens, name is actually barely used right now -- if you can wade through the patch you can see there's just a few instances, the ones in mind now. Making NamedVector is, it

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
: What would be the Writable hierarchy with this NamedVector proposal? On Apr 18, 2010 11:05 AM, Sean Owen sro...@gmail.com wrote: On keeping 'name': sure, I ... On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix jake.man...@gmail.com wrote: Ok this is a good con...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I guess I'm suggesting the polymorphism pain need not be very painful. (No doubt it's all nicer with Avro, but that much can be separate.) VectorWritable is the one Writable used in all cases. We have *Writable decorators, corresponding to *Vector, in a similar hierarchy. We have NamedVector

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix jake.man...@gmail.com wrote: VectorWritable currently is a proper decorator, right?  It doesn't even implement Vector at all. Yeah, the other *Writable classes should be as well. NamedVector should both be a Vector and decorate a Vector too. Its

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Status: Patch Available (was: Open) Assignee: Sean Owen Here's another patch, which builds

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Attachment: MAHOUT-379.patch Hmm my second patch didn't attach SequentialAccessSparseVector.equals

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen
Yeah thats what I changed -- now the key is point.asFormatString(). And it almost works, except the serialized state in this format string includes lengthSquared, and a mismatch there before/after makes this fail. It may fail more significantly in the real world versus tests and we should be

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen
At the moment I'm already overreaching on the way to fix MAHOUT-379 with this patch, as I've expanded to address some mildly related issues (equals, iterators). So I personally am not trying to change serialization formats in MAHOUT-379 / my current patch, no. The issue uncovered by removing name

[jira] Resolved: (MAHOUT-380) IllegalArgumentException from AbstractJDBCDataModel constructor which is extended by AbstractBooleanPrefJDBCDataModel

2010-04-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-380. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Oops! fixed

Re: c# porting of mahout

2010-04-16 Thread Sean Owen
None that I'm aware of, and I might suggest it would be hard at the moment for several reasons: - The code is changing very rapidly - The code depends heavily on Java libraries, notably Hadoop, which makes porting difficult On Fri, Apr 16, 2010 at 10:31 AM, pedram salehpoor

Re: c# porting of mahout

2010-04-16 Thread Sean Owen
Lots of both -- I imagine it will be changing rapidly for the rest of the year. On Fri, Apr 16, 2010 at 10:48 AM, pedram salehpoor pedram.salehp...@gmail.com wrote: For Hadoop I was thinking about making them assemblies usable for c#. But ever changing code is a problem. Do currently new

Re: Having some trouble with SequentialAccessSparseVector.DenseVector

2010-04-16 Thread Sean Owen
Actually it does all work. I wrote some tests that verify it. I think my first question about index and cur works out because both are set to 0 -- and 0 is correct as the starting value of an array offset and index. And in the other case I believe it's intended that the two values are the current

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Attachment: MAHOUT-379.patch This is a pre-patch, per discussion on the mailing list. Is this too much

Re: mahout/solr integration

2010-04-16 Thread Sean Owen
Clojure isn't my cup of tea but that's not important. It's an interesting question, how much belongs under the Mahout tent? There's a tradeoff between excluding useful extensions to the project on the one hand, and becoming a spare parts bin of code of varying levels of maturity and support. I'm

Re: mahout/solr integration

2010-04-16 Thread Sean Owen
On Fri, Apr 16, 2010 at 7:39 PM, Jake Mannix jake.man...@gmail.com wrote: I will start playing around with Anthony's github-based stuff, and see where a patch can be made.  The question is where it would go?  It's a fully functioning project already over on its own. I suppose that's my

Having some trouble with SequentialAccessSparseVector.DenseVector

2010-04-15 Thread Sean Owen
Along the way to a patch for MAHOUT-379, I'm having some trouble figuring out SequentialAccessSparseVector.DenseVector. I think it can be simplified, but unless I'm misunderstanding there are several bugs here. I'd like to find my mistake or else simplify/fix this along the way. get() uses offset

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856812#action_12856812 ] Sean Owen commented on MAHOUT-379: -- Yeah let's take some time to get this right

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Sean Owen
On Wed, Apr 14, 2010 at 3:28 PM, Jake Mannix jake.man...@gmail.com wrote: What is the transitivity problem?  If (a instanceof VClassA), (b instanceof VClassB) and (c instanceof VClassC), if all three equals() methods compare the same things (ie values, names, not implementation), then

Re: VOTE: take 2: mahout-collections-1.0

2010-04-12 Thread Sean Owen
Sure +1 for same reason. On Mon, Apr 12, 2010 at 4:50 AM, Ted Dunning ted.dunn...@gmail.com wrote: +1 (on trust, really)

[jira] Commented: (MAHOUT-377) Clean up javadoc errors in collections

2010-04-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855896#action_12855896 ] Sean Owen commented on MAHOUT-377: -- I committed a fix for these immediate issues. Clean

Re: Digest for google-summer-of-code-mentors-l...@googlegroups.com - 25 Messages in 2 Topics

2010-04-10 Thread Sean Owen
+mahout-dev I think at this point I could be misremembering (there's that word again Grant) but are we not supposed to sign on to mentor more than 1 person without having talked it over on code-awards? Seems like a lot of grumbling about gaming the system and such from past years, which seems

[jira] Resolved: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-372. -- Resolution: Fixed Fix Version/s: 0.4 Assignee: Sean Owen Yes, sure there's

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

2010-04-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854992#action_12854992 ] Sean Owen commented on MAHOUT-369: -- If the patch amounts to making that loop

Re: VOTE: release mahout-collections-codegen 1.0

2010-04-07 Thread Sean Owen
+1

[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-07 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854371#action_12854371 ] Sean Owen commented on MAHOUT-358: -- Yes, I know how the method works. That's as I expected

[jira] Commented: (MAHOUT-366) Error: Java heap space

2010-04-07 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854426#action_12854426 ] Sean Owen commented on MAHOUT-366: -- Try HADOOP_HEAPSIZE=2000 in order to increase

[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-07 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854432#action_12854432 ] Sean Owen commented on MAHOUT-358: -- The output is helpful. I am more confused, we may

[jira] Resolved: (MAHOUT-366) Error: Java heap space

2010-04-07 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-366. -- Resolution: Not A Problem Assignee: Sean Owen Error: Java heap space

  1   2   3   4   5   6   7   8   9   10   >