Hi Sid,

One way is to use WhiteSpaceTokenizer and WordDelimeterFilter.


In some cases you might want to adjust how WordDelimiterFilter splits on a 
per-character basis. To do this, you can supply a configuration file with the 
"types" attribute that specifies custom character categories. An example file 
is in subversion here. This is especially useful to add "hashtag or currency" 
searches.

Please see: 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
https://issues.apache.org/jira/browse/SOLR-2059

@ => ALPHA
# => ALPHA

P.S. Maintaining a custom tonizer will be a burden. It is done with *.jflex 
files blended with java files.
Please see ClassicTokenizerImpl.jflex in the source tree for an example.

Ahmet




On Monday, September 28, 2015 1:58 AM, Siddhartha Singh Sandhu 
<[email protected]> wrote:



Hi Ahmet,

I want primarily 3 things. 

1. To include # and @ as part of the string which is tokenized by the standard 
tokenizer which generally strips it off.
2. When a string is tokenized,I just want to keep tokens which are #tags and 
@mentions.
3. I understand there is PatternTokenizer but I wanted to leverage twitter-text 
github to because I trust there regex more then my own.

Not only the above three, but I also need to control the special characters 
that are striped from my string while tokenizing.

Please let me know of your views.

Regards,

Sid.


On Sun, Sep 27, 2015 at 5:21 PM, Ahmet Arslan <[email protected]> wrote:

Hi Sid,
>
>Can you provide us more details?
>
>Usually you can get away without a custom tokenizer, there may be other tricks 
>to achieve your requirements.
>
>Ahmet
>
>
>
>
>On Sunday, September 27, 2015 11:29 PM, Siddhartha Singh Sandhu 
><[email protected]> wrote:
>
>
>
>Hi Everyone,
>
>I wanted to write a custom tokenizer and wanted a generic direction and some 
>guidance on how I should go about achieving this goal.
>
>Your input will be much appreciated.
>
>Regards,
>
>Sid.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to