[ 
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615324#action_12615324
 ] 

Steven Handerson commented on MAHOUT-60:
----------------------------------------

I'll add one more thing so you see where I'm coming from.

Map-reduce is basically a hash join (database term),
which are actually slow in a database in my experience
(because of the space allocation required, but also just not using an index).
Map-reduce makes hash joins relatively fast,
by using multiple processors and networking.
You could do other kinds of joins in map-reduce, use indexes, use ordered data,
things like "partitions" in database parlance,
or you could just use a database.
Databases have a problem that the round-trip to the database
sometimes makes your application much slower than necessary,
for doing lots of individual queries in sequence (randomly accessing
but doing so using an index) --
they are good at streaming the results of a join out (which use indexes).
Of course, some applications (like web-based) are slower in 
aggregate, in order to answer individual queries relatively quickly
(faster round-trip time).
There may be similar issues with respect to map-reduce,
but you can see there's a kind of connection between what
databases do and map-reduce does: join data sources on some field or computed 
value.

Hmm -- also, it's not that the data is too large (yet) --
my model is about 1.5 Gig, and I'm (as of now) trying using the code
in a single process rather than hadoop, but maybe the model
size isn't the max process size (-Xmx4g didn't work), so I'm trying larger and 
larger -Xmx
(and I do have 64 bit java available -- right now trying 8 Gig).



> Complementary Naive Bayes
> -------------------------
>
>                 Key: MAHOUT-60
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-60
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Classification
>            Reporter: Robin Anil
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.1
>
>         Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, 
> twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper 
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to