LuceneIndexToSequenceFiles (lucene2seq) utility
-----------------------------------------------

                 Key: MAHOUT-944
                 URL: https://issues.apache.org/jira/browse/MAHOUT-944
             Project: Mahout
          Issue Type: New Feature
          Components: Integration
    Affects Versions: 0.7
            Reporter: Frank Scholten
            Priority: Minor
             Fix For: 0.6, 0.5


Here is a lucene2seq tool I used in a project. It creates sequence files based 
on the stored fields of a lucene index.

The output from this tool can be then fed into seq2sparse and from there you 
can do text clustering.

Comes with Java bean configuration.

Let me know what you think. Some CLI code can be added later on. I used this 
for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
overkill?

See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
review comments from Simon Willnauer (Thanks Simon!)

or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to