[
https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491852
]
Ken Krugler commented on SOLR-211:
----------------------------------
I think we must be working on similar types of projects :)
I did something similar to the above, but in two different ways:
# I extended WhitespaceTokenizerFactory to take optional pattern & replacement
parameters. If these exist, then I apply them before the tokenizer gets called.
This lets me do something like strip out all XML fields other than the content
of the one that I want to index from a bunch of XML going into a Solr field.
# I added a CSVTokenizerFactory, which takes an optional split character and an
optional remapping file. This lets me get a field like "Java,Python,C#" and
turn it into "java python csharp", which are the index tokens I need, while
leaving the display text as-is.
I don't know if your new PatternTokenizerFactory could replace either of these,
though. For the first case, I still want the white space tokenization after
I've stripped off all the junk I don't want. And for the second, I need to be
able to do the remapping.
> regex split() Tokenizer
> -----------------------
>
> Key: SOLR-211
> URL: https://issues.apache.org/jira/browse/SOLR-211
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
> Attachments: SOLR-211-RegexSplitTokenizer.patch,
> SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
> string.split( regex );
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.