[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Jake Mannix (JIRA) Tue, 09 Feb 2010 02:27:52 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831396#action_12831396
 ]


Jake Mannix commented on MAHOUT-237:
------------------------------------

{code}
    RandomAccessSparseVector vector =
        new RandomAccessSparseVector(key.toString(), Integer.MAX_VALUE,
            valueString.length() / 5); // guess at initial size
{code}

This whole Integer.MAX_VALUE thing is killing me whenever I try to move back 
and forth between sparse and dense vectors (which is necessary for performance 
in the DistributedLanczos I'm working on).  Ugh.  

We really need to have a vector flag which says "I'm infinite dimensional, I 
just return 0 whenever you ask me about dimensions I don't know about", so we 
don't have to have this hack of Integer.MAX_VALUE as the dimension.  I've 
suggested it to people myself, but it's such a baaaaad hack.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, 
> SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors 
> of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = 
> content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them 
> to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create 
> Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document 
> Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Reply via email to