That's good to know. If we go this route, we'll definitely either use the factory, or follow its example. Thanks again
-Mike On Mon, Jun 4, 2018 at 9:12 PM, Robert Muir <rcm...@gmail.com> wrote: > There may be a traps, e.g. if you make such a filter with UnicodeSet, > I think you really need to call .freeze() before passing it to this > thing. I have not examined the sources in a while but I think this > might be similar to "compiling a regexp" in that you'll then get good > performance when its later called millions of times. > > If you use the factories, it will do this for you. But if you use the > API directly it is currently a bit of a performance trap... > > On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov <msoko...@gmail.com> > wrote: > > Ah thanks! That's very good to know. As it is I realized we already have > an > > earlier component where we can handle this (we have a custom ICUTokenizer > > rbbi and can just split on "^"). So many flexibility > > > > -Mike > > > > On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <rcm...@gmail.com> wrote: > > > >> actually, you now can choose to ignore certain characters by using > >> unicode filtering mechanism. > >> > >> This was added in https://issues.apache.org/jira/browse/LUCENE-8129 > >> > >> So apply a filter such as [^\^] and the filter will ignore ^. > >> > >> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <rcm...@gmail.com> wrote: > >> > This cannot be "tweaked" at runtime, it is implemented as custom > >> normalization. > >> > > >> > You can modify the sources / build your own ruleset or use a different > >> > tokenfilter to normalize characters. > >> > > >> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <msoko...@gmail.com> > >> wrote: > >> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly > >> what I > >> >> want. However there are some behaviors I'd like to tweak. For > example it > >> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does > that, > >> and > >> >> whether there is any way to prevent it. > >> >> > >> >> I spent a little time with > >> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I > >> guess > >> >> is the basis for what this filter does (it's referenced in the > >> javadocs), > >> >> but that didn't answer my questions. As an aside, it seems this tech > >> report > >> >> was withdfrawn by the unicode consortium? Not sure what that means if > >> >> anything, but it seems ominous. > >> >> > >> >> Anyway, I would appreciate pointers to more info, and specifically, > >> whether > >> >> there are any alternatives to the utr30.nrm data file, or any > >> possibility > >> >> to select among the many transformations this filter applies. > >> >> > >> >> Thanks! > >> >> > >> >> Mike S > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >