[
https://issues.apache.org/jira/browse/MAHOUT-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075277#comment-14075277
]
ASF GitHub Bot commented on MAHOUT-1598:
----------------------------------------
Github user tdunning commented on the pull request:
https://github.com/apache/mahout/pull/34#issuecomment-50225044
This looks pretty good. I have a few questions:
a) is this backwards compatible? Is there a test to demonstrate this?
b) what documentation impact is there?
> extend seq2sparse to handle multiple text blocks of same document
> -----------------------------------------------------------------
>
> Key: MAHOUT-1598
> URL: https://issues.apache.org/jira/browse/MAHOUT-1598
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 1.0, 0.9
> Reporter: Wolfgang Buchner
>
> Currently the seq2sparse or in particular the
> org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one
> text block per document.
> I stumbled on this because i'm having an use case where one document
> represents a ticket which can have several text blocks in different
> languages.
> So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall
> tokenize each text block itself. So i can use language specific features in
> our Lucene Analyzer.
> Unfortunately the current implementation doesn't support this.
> But with just minor changes this can be made possible.
> The only thing which has to be changed would be the
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values
> of the iterable (not just the 1st one >.<)
> An Alternative would be to change this Reducer to a Mapper, i don't get why
> in the 1st place this is implemented as an reducer. Is there any benefit from
> this?
> I will provide a PR via github.
> Please have a look onto this and tell me if i am assuming anything wrong.
--
This message was sent by Atlassian JIRA
(v6.2#6252)