[ 
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919361#comment-13919361
 ] 

Suneel Marthi commented on MAHOUT-1252:
---------------------------------------

Hi Drew,

I have some start code on this too and its been a while since I looked at this, 
glad to discuss options.

1. Yes this would be a 3rd dictionary output format in addition to the Text and 
SequenceFile output formats already present.
2. That's correct, the idea is to avoid having to go through a one-shot 
seq2sparse for new document additions. 
3. term index, df, tf-idf ???

While on this topic, below's an email from past that Ted had some comments when 
the discussion had come up about running multiple document corpora thru 
seq2sparse:

{Code}

SVFSF really is designed for a one-shot sort of processing.

The issues arise with all of the corpus frequency cutoffs and such.  N-gram
detection, frequency cutoffs and so on are all going to be problems with
piecewise conversion.

If all you use it for is tokenizing, then there isn't a problem.

If you are interested in a more incremental architecture, I expect that it
would be best to

a) switch to a more incremental sort of dictionary so that new tokens can
be added easily

b) not use Strings so much in the tokenization (could result in substantial
speedups)

c) define an intermediate format for token and n-gram counts

d) write code that supports combination of sub-corpora.

The other very interesting option would be to simply create Lucene indices
as your document repository format.  These would satisfy requirements a
through d quite easily.

{Code}

If we were to redo this all over, could we leverage Lucene 4.7's in-memory term 
vectors? haven't looked at Lucene 4.7 closely yet.



> Add support for Finite State Transducers (FST) as a DictionaryType.
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-1252
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1252
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Add support for Finite State Transducers (FST) as a DictionaryType, this 
> should result in an order of magnitude speedup of seq2sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to