Resolved. Used \u0023 instead of #. On Mon, Sep 28, 2015 at 10:20 AM, Siddhartha Singh Sandhu < [email protected]> wrote:
> Hi Ahmet, > > Worked partly. I might be doing something wrong: > > *My wdfftypes.txt is:* > > ` > # A customized type mapping for WordDelimiterFilterFactory > # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, > SUBWORD_DELIM > # > # the default for any character without a mapping is always computed from > # Unicode character properties > > # Map the $, %, '.', and ',' characters to DIGIT > # This might be useful for financial data. > > # => ALPHA > @ => ALPHA > ` > > Problem is* #* is used for comments in this file. And the @ sign works > perfectly when I analyze it but the # does not display the same behavior: > > [image: Inline image 1] > > My schema.xml has the following field corresponding to this analysis: > ` > <fieldtype name="subword_twit" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts="0" catenateWords="0" > catenateNumbers="0" catenateAll="0" preserveOriginal="1" > types="wdfftypes.txt" /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts="0" catenateWords="0" > catenateNumbers="0" catenateAll="0" preserveOriginal="1" > types="wdfftypes.txt" /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > </fieldtype> > ` > > Regards, > > Sid. > > On Sun, Sep 27, 2015 at 9:23 PM, Ahmet Arslan <[email protected]> > wrote: > >> Hi Sid, >> >> >> One way is to use WhiteSpaceTokenizer and WordDelimeterFilter. >> >> >> In some cases you might want to adjust how WordDelimiterFilter splits on >> a per-character basis. To do this, you can supply a configuration file with >> the "types" attribute that specifies custom character categories. An >> example file is in subversion here. This is especially useful to add >> "hashtag or currency" searches. >> >> Please see: >> >> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory >> https://issues.apache.org/jira/browse/SOLR-2059 >> >> @ => ALPHA >> # => ALPHA >> >> P.S. Maintaining a custom tonizer will be a burden. It is done with >> *.jflex files blended with java files. >> Please see ClassicTokenizerImpl.jflex in the source tree for an example. >> >> Ahmet >> >> >> >> >> On Monday, September 28, 2015 1:58 AM, Siddhartha Singh Sandhu < >> [email protected]> wrote: >> >> >> >> Hi Ahmet, >> >> I want primarily 3 things. >> >> 1. To include # and @ as part of the string which is tokenized by the >> standard tokenizer which generally strips it off. >> 2. When a string is tokenized,I just want to keep tokens which are #tags >> and @mentions. >> 3. I understand there is PatternTokenizer but I wanted to leverage >> twitter-text github to because I trust there regex more then my own. >> >> Not only the above three, but I also need to control the special >> characters that are striped from my string while tokenizing. >> >> Please let me know of your views. >> >> Regards, >> >> Sid. >> >> >> On Sun, Sep 27, 2015 at 5:21 PM, Ahmet Arslan <[email protected]> >> wrote: >> >> Hi Sid, >> > >> >Can you provide us more details? >> > >> >Usually you can get away without a custom tokenizer, there may be other >> tricks to achieve your requirements. >> > >> >Ahmet >> > >> > >> > >> > >> >On Sunday, September 27, 2015 11:29 PM, Siddhartha Singh Sandhu < >> [email protected]> wrote: >> > >> > >> > >> >Hi Everyone, >> > >> >I wanted to write a custom tokenizer and wanted a generic direction and >> some guidance on how I should go about achieving this goal. >> > >> >Your input will be much appreciated. >> > >> >Regards, >> > >> >Sid. >> > >> >--------------------------------------------------------------------- >> >To unsubscribe, e-mail: [email protected] >> >For additional commands, e-mail: [email protected] >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
