Hi everyone,

Looking at the text_ci fieldType in our Solr schema.xml, I'm trying to
figure out why we use a PatternTokenizerFactory:

    <!-- As above, but preserves original tokens and doesn't require a
phrase match on queries that yield multiple tokens.
      Intended for fuzzier matching of usernames with varied case. -->
    <fieldType name="text_ci" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory"
pattern="(\p{Punct}|\p{Space})+" />
        <!-- Case insensitive stop word removal. -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

Based on the documentation of PatternTokenizerFactory, specifying a
pattern without a group will be synonymous to doing
String.split(<pattern>), which will just split the string by
punctuation and spaces. I think the StandardTokenizer could do this in
a more direct way, and since 3.1 / 4.0 the tokenizer supports Unicode.

Does anyone see any problems with switching this to the StandardTokenizer?

(by the way, I'm a little clueless with Solr indexing and
wiki.apache.org is down right now... sorry if this proposal is
fundamentally wrong..)

-- 
Cheers,
Branden
_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Reply via email to