That sure looks right to me, Branden. Since I can reproduce this
reliably now, I'll test your suggestion out.
Best,
Ray
On 6/26/12 2:16 PM, Branden Visser wrote:
> Hi everyone,
>
> Looking at the text_ci fieldType in our Solr schema.xml, I'm trying to
> figure out why we use a PatternTokenizerFactory:
>
> <!-- As above, but preserves original tokens and doesn't require a
> phrase match on queries that yield multiple tokens.
> Intended for fuzzier matching of usernames with varied case. -->
> <fieldType name="text_ci" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> <analyzer>
> <tokenizer class="solr.PatternTokenizerFactory"
> pattern="(\p{Punct}|\p{Space})+" />
> <!-- Case insensitive stop word removal. -->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Based on the documentation of PatternTokenizerFactory, specifying a
> pattern without a group will be synonymous to doing
> String.split(<pattern>), which will just split the string by
> punctuation and spaces. I think the StandardTokenizer could do this in
> a more direct way, and since 3.1 / 4.0 the tokenizer supports Unicode.
>
> Does anyone see any problems with switching this to the StandardTokenizer?
>
> (by the way, I'm a little clueless with Solr indexing and
> wiki.apache.org is down right now... sorry if this proposal is
> fundamentally wrong..)
>
_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev