[jira] [Commented] (MAHOUT-1598) extend seq2sparse to handle multiple text blocks of same document

ASF GitHub Bot (JIRA) Fri, 25 Jul 2014 04:33:50 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074305#comment-14074305
 ]


ASF GitHub Bot commented on MAHOUT-1598:
----------------------------------------

GitHub user wobu opened a pull request:

    https://github.com/apache/mahout/pull/34

    MAHOUT-1598

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wobu/mahout mahout-1598

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/mahout/pull/34.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #34
    
----
commit ccaec1b267d768e890695dd36c49a34d621e1e73
Author: wobu <[email protected]>
Date:   2014-07-25T11:01:10Z

    extended TFPartialVectorReducer.java to handle multiple text blocks of one 
document

----


> extend seq2sparse to handle multiple text blocks of same document
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-1598
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1598
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 1.0, 0.9
>            Reporter: Wolfgang Buchner
>
> Currently the seq2sparse or in particular the 
> org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one 
> text block per document.
> I stumbled on this because i'm having an use case where one document 
> represents a ticket which can have several text blocks in different 
> languages. 
> So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall 
> tokenize each text block itself. So i can use language specific features in 
> our Lucene Analyzer.
> Unfortunately the current implementation doesn't support this.
> But with just minor changes this can be made possible.
> The only thing which has to be changed would be the 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values 
> of the iterable (not just the 1st one >.<)
> An Alternative would be to change this Reducer to a Mapper, i don't get why 
> in the 1st place this is implemented as an reducer. Is there any benefit from 
> this?
> I will provide a PR via github.
> Please have a look onto this and tell me if i am assuming anything wrong.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1598) extend seq2sparse to handle multiple text blocks of same document

Reply via email to