Re: Writing custom Tokenizer

Siddhartha Singh Sandhu Mon, 28 Sep 2015 07:38:25 -0700

Resolved. Used \u0023 instead of #.

On Mon, Sep 28, 2015 at 10:20 AM, Siddhartha Singh Sandhu <
[email protected]> wrote:


> Hi Ahmet,
>
> Worked partly. I might be doing something wrong:
>
> *My wdfftypes.txt is:*
>
> `
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM,
> SUBWORD_DELIM
> #
> # the default for any character without a mapping is always computed from
> # Unicode character properties
>
> # Map the $, %, '.', and ',' characters to DIGIT
> # This might be useful for financial data.
>
> # => ALPHA
> @ => ALPHA
> `
>
> Problem is* #* is used for comments in this file. And the @ sign works
> perfectly when I analyze it but the # does not display the same behavior:
>
> [image: Inline image 1]
>
> My schema.xml has the following field corresponding to this analysis:
> `
>   <fieldtype name="subword_twit" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="1"
> types="wdfftypes.txt"  />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="1"
> types="wdfftypes.txt"  />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>   </fieldtype>
> `
>
> Regards,
>
> Sid.
>
> On Sun, Sep 27, 2015 at 9:23 PM, Ahmet Arslan <[email protected]>
> wrote:
>
>> Hi Sid,
>>
>>
>> One way is to use WhiteSpaceTokenizer and WordDelimeterFilter.
>>
>>
>> In some cases you might want to adjust how WordDelimiterFilter splits on
>> a per-character basis. To do this, you can supply a configuration file with
>> the "types" attribute that specifies custom character categories. An
>> example file is in subversion here. This is especially useful to add
>> "hashtag or currency" searches.
>>
>> Please see:
>>
>> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
>> https://issues.apache.org/jira/browse/SOLR-2059
>>
>> @ => ALPHA
>> # => ALPHA
>>
>> P.S. Maintaining a custom tonizer will be a burden. It is done with
>> *.jflex files blended with java files.
>> Please see ClassicTokenizerImpl.jflex in the source tree for an example.
>>
>> Ahmet
>>
>>
>>
>>
>> On Monday, September 28, 2015 1:58 AM, Siddhartha Singh Sandhu <
>> [email protected]> wrote:
>>
>>
>>
>> Hi Ahmet,
>>
>> I want primarily 3 things.
>>
>> 1. To include # and @ as part of the string which is tokenized by the
>> standard tokenizer which generally strips it off.
>> 2. When a string is tokenized,I just want to keep tokens which are #tags
>> and @mentions.
>> 3. I understand there is PatternTokenizer but I wanted to leverage
>> twitter-text github to because I trust there regex more then my own.
>>
>> Not only the above three, but I also need to control the special
>> characters that are striped from my string while tokenizing.
>>
>> Please let me know of your views.
>>
>> Regards,
>>
>> Sid.
>>
>>
>> On Sun, Sep 27, 2015 at 5:21 PM, Ahmet Arslan <[email protected]>
>> wrote:
>>
>> Hi Sid,
>> >
>> >Can you provide us more details?
>> >
>> >Usually you can get away without a custom tokenizer, there may be other
>> tricks to achieve your requirements.
>> >
>> >Ahmet
>> >
>> >
>> >
>> >
>> >On Sunday, September 27, 2015 11:29 PM, Siddhartha Singh Sandhu <
>> [email protected]> wrote:
>> >
>> >
>> >
>> >Hi Everyone,
>> >
>> >I wanted to write a custom tokenizer and wanted a generic direction and
>> some guidance on how I should go about achieving this goal.
>> >
>> >Your input will be much appreciated.
>> >
>> >Regards,
>> >
>> >Sid.
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: [email protected]
>> >For additional commands, e-mail: [email protected]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Writing custom Tokenizer

Reply via email to