Hi Sid,
One way is to use WhiteSpaceTokenizer and WordDelimeterFilter. In some cases you might want to adjust how WordDelimiterFilter splits on a per-character basis. To do this, you can supply a configuration file with the "types" attribute that specifies custom character categories. An example file is in subversion here. This is especially useful to add "hashtag or currency" searches. Please see: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory https://issues.apache.org/jira/browse/SOLR-2059 @ => ALPHA # => ALPHA P.S. Maintaining a custom tonizer will be a burden. It is done with *.jflex files blended with java files. Please see ClassicTokenizerImpl.jflex in the source tree for an example. Ahmet On Monday, September 28, 2015 1:58 AM, Siddhartha Singh Sandhu <[email protected]> wrote: Hi Ahmet, I want primarily 3 things. 1. To include # and @ as part of the string which is tokenized by the standard tokenizer which generally strips it off. 2. When a string is tokenized,I just want to keep tokens which are #tags and @mentions. 3. I understand there is PatternTokenizer but I wanted to leverage twitter-text github to because I trust there regex more then my own. Not only the above three, but I also need to control the special characters that are striped from my string while tokenizing. Please let me know of your views. Regards, Sid. On Sun, Sep 27, 2015 at 5:21 PM, Ahmet Arslan <[email protected]> wrote: Hi Sid, > >Can you provide us more details? > >Usually you can get away without a custom tokenizer, there may be other tricks >to achieve your requirements. > >Ahmet > > > > >On Sunday, September 27, 2015 11:29 PM, Siddhartha Singh Sandhu ><[email protected]> wrote: > > > >Hi Everyone, > >I wanted to write a custom tokenizer and wanted a generic direction and some >guidance on how I should go about achieving this goal. > >Your input will be much appreciated. > >Regards, > >Sid. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [email protected] >For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
