[
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900755#action_12900755
]
Stanislaw Osinski commented on SOLR-1804:
-----------------------------------------
Hi Robert,
Some initial work on tighter integration with Solr should be possible after
applying the patch from this issue. The patch contains a Solr-specific
implementation of Carrot2's
[ILanguageModel|http://download.carrot2.org/stable/javadoc/org/carrot2/text/linguistic/ILanguageModel.html]
interface. My rough guess is that the implementation of that interface could
be further tweaked to create IStemmer and ITokenizer implementations based on
the schema.xml settings. It could also implement the isCommonWord() method
based on Solr's resources. A few notes though:
* Carrot2 is slightly different from typical IR in a sense that it doesn't
completely discard stop words -- the tokenizer does not remove them from the
token stream. The reason for this is that the cluster labels are taken
literally from the input text and if we discard stop words, the labels won't as
readable.
* The ILanguageModel#isStopLabel() method is another Carrot2-specific thing.
It's a more fine-grained method of removing useless labels, especially useful
for domain-specific content. Carrot2's default implementation is based on
regular expressions similar to
[this|https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk/core/carrot2-util-text/src-resources/stoplabels.en].
I'm not sure if there's a corresponding resource in Solr though.
We're thinking of restructuring Carrot2's language model a bit in one of the
next releases, so it's a good chance to include some Solr-inspired improvements
as well.
S.
> Upgrade Carrot2 to 3.2.0
> ------------------------
>
> Key: SOLR-1804
> URL: https://issues.apache.org/jira/browse/SOLR-1804
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Clustering
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
> Attachments: SOLR-1804-carrot2-3.4.0-dev-trunk.patch,
> SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804-carrot2-3.4.0-libs.zip,
> SOLR-1804.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]