Re: Twitter analyser

Jack Krupansky Tue, 05 Nov 2013 08:42:40 -0800

You can specify custom character types with the word delimiter filter, soyou could define "@" and "#" as "digit" and set SPLIT_ON_NUMERICS. Thiswould cause "@foo" to tokenize as two adjacent terms, ditto for "#foo".Unfortunately, A user name or tag that starts with a digit would nottokenize as desired, but that seems uncommon. "foo" would match all threesince the "@" or "#" would tokenize as a separate term.


Use:


public WordDelimiterFilter(TokenStream in,
                          byte[] charTypeTable,
                          int configurationFlags,
                          CharArraySet protWords)

See:
http://lucene.apache.org/core/4_5_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

-- Jack Krupansky

-----Original Message-----From: Stéphane Nicoll

Sent: Tuesday, November 05, 2013 2:40 AM
To: java-user@lucene.apache.org
Subject: Twitter analyser

Hi,

I am building an application that indexes tweet and offer some basic
search facilities on them.

I am trying to find a combination where the following would work:

* foo matches the foo word, a mention (@foo) or the hashtag (#foo)
* @foo only matches the mention
* #foo matches only the hashtag

It should matches complete word so I used the WhiteSpaceAnalyzer forindexing.


Any recommendation for this use case?

Thanks !
S.

Sent from my iPhone

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Twitter analyser

Reply via email to