Re: [oae-dev] Lucene PatternTokenizerFactory

Ray Davis Tue, 26 Jun 2012 14:45:15 -0700

That sure looks right to me, Branden. Since I can reproduce this 
reliably now, I'll test your suggestion out.


Best,
Ray

On 6/26/12 2:16 PM, Branden Visser wrote:
> Hi everyone,
>
> Looking at the text_ci fieldType in our Solr schema.xml, I'm trying to
> figure out why we use a PatternTokenizerFactory:
>
>      <!-- As above, but preserves original tokens and doesn't require a
> phrase match on queries that yield multiple tokens.
>        Intended for fuzzier matching of usernames with varied case. -->
>      <fieldType name="text_ci" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>        <analyzer>
>          <tokenizer class="solr.PatternTokenizerFactory"
> pattern="(\p{Punct}|\p{Space})+" />
>          <!-- Case insensitive stop word removal. -->
>          <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>          <filter class="solr.PorterStemFilterFactory"/>
>        </analyzer>
>      </fieldType>
>
> Based on the documentation of PatternTokenizerFactory, specifying a
> pattern without a group will be synonymous to doing
> String.split(<pattern>), which will just split the string by
> punctuation and spaces. I think the StandardTokenizer could do this in
> a more direct way, and since 3.1 / 4.0 the tokenizer supports Unicode.
>
> Does anyone see any problems with switching this to the StandardTokenizer?
>
> (by the way, I'm a little clueless with Solr indexing and
> wiki.apache.org is down right now... sorry if this proposal is
> fundamentally wrong..)
>


_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Lucene PatternTokenizerFactory

Reply via email to