Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Ted Dunning Tue, 09 Feb 2010 11:28:55 -0800

That was my first thought as well.

But I think a better answer is to mark the vector as stretchy so that it
reports the high water size as the actual size, but if you insert a non-zero
above that size, it will report the new high water mark thereafter.


This makes the code simple and clear.  The only change needed is to soften
the out of bounds checks for put.

On Tue, Feb 9, 2010 at 5:57 AM, Sean Owen (JIRA) <j...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831457#action_12831457]
>
> Sean Owen commented on MAHOUT-237:
> ----------------------------------
>
> Sounds like what you really need (and what I could use) is something like
> getHighestNonZeroIndex() ?
>
> > Map/Reduce Implementation of Document Vectorizer
> > ------------------------------------------------
> >
> >                 Key: MAHOUT-237
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
> >             Project: Mahout
> >          Issue Type: New Feature
> >    Affects Versions: 0.3
> >            Reporter: Robin Anil
> >            Assignee: Robin Anil
> >             Fix For: 0.3
> >
> >         Attachments: DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch,
> SparseVector-VIntWritable.patch
> >
> >
> > Current Vectorizer uses Lucene Index to convert documents into
> SparseVectors
> > Ted is working on a Hash based Vectorizer which can map features into
> Vectors of fixed size and sum it up to get the document Vector
> > This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> > The input document is in SequenceFile<Text,Text> . with key = docid,
> value = content
> > First Map/Reduce over the document collection and generate the feature
> counts.
> > Second Sequential pass reads the output of the map/reduce and converts
> them to SequenceFile<Text, LongWritable> where key=feature, value = unique
> id
> >     Second stage should create shards of features of a given split size
> > Third Map/Reduce over the document collection, using each shard and
> create Partial(containing the features of the given shard) SparseVectors
> > Fourth Map/Reduce over partial shard, group by docid, create full
> document Vector
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Reply via email to