Ah thanks! That's very good to know. As it is I realized we already have an earlier component where we can handle this (we have a custom ICUTokenizer rbbi and can just split on "^"). So many flexibility
-Mike On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <rcm...@gmail.com> wrote: > actually, you now can choose to ignore certain characters by using > unicode filtering mechanism. > > This was added in https://issues.apache.org/jira/browse/LUCENE-8129 > > So apply a filter such as [^\^] and the filter will ignore ^. > > On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <rcm...@gmail.com> wrote: > > This cannot be "tweaked" at runtime, it is implemented as custom > normalization. > > > > You can modify the sources / build your own ruleset or use a different > > tokenfilter to normalize characters. > > > > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <msoko...@gmail.com> > wrote: > >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly > what I > >> want. However there are some behaviors I'd like to tweak. For example it > >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, > and > >> whether there is any way to prevent it. > >> > >> I spent a little time with > >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I > guess > >> is the basis for what this filter does (it's referenced in the > javadocs), > >> but that didn't answer my questions. As an aside, it seems this tech > report > >> was withdfrawn by the unicode consortium? Not sure what that means if > >> anything, but it seems ominous. > >> > >> Anyway, I would appreciate pointers to more info, and specifically, > whether > >> there are any alternatives to the utr30.nrm data file, or any > possibility > >> to select among the many transformations this filter applies. > >> > >> Thanks! > >> > >> Mike S > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >