Re: Writing custom Tokenizer

Siddhartha Singh Sandhu Mon, 28 Sep 2015 07:21:19 -0700

Hi Ahmet,

Worked partly. I might be doing something wrong:


*My wdfftypes.txt is:*

`
# A customized type mapping for WordDelimiterFilterFactory
# the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM,
SUBWORD_DELIM
#
# the default for any character without a mapping is always computed from
# Unicode character properties

# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.

# => ALPHA
@ => ALPHA
`

Problem is* #* is used for comments in this file. And the @ sign works
perfectly when I analyze it but the # does not display the same behavior:

[image: Inline image 1]

My schema.xml has the following field corresponding to this analysis:
`
  <fieldtype name="subword_twit" class="solr.TextField"
positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" preserveOriginal="1"
types="wdfftypes.txt"  />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" preserveOriginal="1"
types="wdfftypes.txt"  />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
  </fieldtype>
`

Regards,

Sid.

On Sun, Sep 27, 2015 at 9:23 PM, Ahmet Arslan <[email protected]>
wrote:

> Hi Sid,
>
>
> One way is to use WhiteSpaceTokenizer and WordDelimeterFilter.
>
>
> In some cases you might want to adjust how WordDelimiterFilter splits on a
> per-character basis. To do this, you can supply a configuration file with
> the "types" attribute that specifies custom character categories. An
> example file is in subversion here. This is especially useful to add
> "hashtag or currency" searches.
>
> Please see:
>
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> https://issues.apache.org/jira/browse/SOLR-2059
>
> @ => ALPHA
> # => ALPHA
>
> P.S. Maintaining a custom tonizer will be a burden. It is done with
> *.jflex files blended with java files.
> Please see ClassicTokenizerImpl.jflex in the source tree for an example.
>
> Ahmet
>
>
>
>
> On Monday, September 28, 2015 1:58 AM, Siddhartha Singh Sandhu <
> [email protected]> wrote:
>
>
>
> Hi Ahmet,
>
> I want primarily 3 things.
>
> 1. To include # and @ as part of the string which is tokenized by the
> standard tokenizer which generally strips it off.
> 2. When a string is tokenized,I just want to keep tokens which are #tags
> and @mentions.
> 3. I understand there is PatternTokenizer but I wanted to leverage
> twitter-text github to because I trust there regex more then my own.
>
> Not only the above three, but I also need to control the special
> characters that are striped from my string while tokenizing.
>
> Please let me know of your views.
>
> Regards,
>
> Sid.
>
>
> On Sun, Sep 27, 2015 at 5:21 PM, Ahmet Arslan <[email protected]>
> wrote:
>
> Hi Sid,
> >
> >Can you provide us more details?
> >
> >Usually you can get away without a custom tokenizer, there may be other
> tricks to achieve your requirements.
> >
> >Ahmet
> >
> >
> >
> >
> >On Sunday, September 27, 2015 11:29 PM, Siddhartha Singh Sandhu <
> [email protected]> wrote:
> >
> >
> >
> >Hi Everyone,
> >
> >I wanted to write a custom tokenizer and wanted a generic direction and
> some guidance on how I should go about achieving this goal.
> >
> >Your input will be much appreciated.
> >
> >Regards,
> >
> >Sid.
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [email protected]
> >For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Writing custom Tokenizer

Reply via email to