[
https://issues.apache.org/jira/browse/LUCENE-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990718#comment-14990718
]
Uwe Schindler commented on LUCENE-6879:
---------------------------------------
Just FYI: I did some quick microbenchmark like this:
{code:java}
// init & warmup
String text = "Tokenizer(Test)FooBar";
String[] result = new String[] { "tokenizer", "test", "foobar" };
final Tokenizer tokenizer1 =
CharTokenizer.fromTokenCharPredicate(Character::isLetter,
Character::toLowerCase);
for (int i = 0; i < 10000; i++) {
tokenizer1.setReader(new StringReader(text));
assertTokenStreamContents(tokenizer1, result);
}
final Tokenizer tokenizer2 = new LowerCaseTokenizer();
for (int i = 0; i < 10000; i++) {
tokenizer2.setReader(new StringReader(text));
assertTokenStreamContents(tokenizer2, result);
}
// speed test
long [] lens1 = new long[100], lens2 = new long[100];
for (int j = 0; j < 100; j++) {
System.out.println("Run: " + j);
long start1 = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
tokenizer1.setReader(new StringReader(text));
assertTokenStreamContents(tokenizer1, result);
}
lens1[j] = System.currentTimeMillis() - start1;
long start2 = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
tokenizer2.setReader(new StringReader(text));
assertTokenStreamContents(tokenizer2, result);
}
lens2[j] = System.currentTimeMillis() - start2;
}
System.out.println("Time Lambda: " + Arrays.stream(lens1).summaryStatistics());
System.out.println("Time Old: " + Arrays.stream(lens2).summaryStatistics());
{code}
I was not able to find any speed difference after warmup:
- Time Lambda: LongSummaryStatistics{count=100, sum=58267, min=562,
average=582.670000, max=871}
- Time Old: LongSummaryStatistics{count=100, sum=61489, min=600,
average=614.890000, max=721}
> Allow to define custom CharTokenizer using Java 8 Lambdas/Method references
> ---------------------------------------------------------------------------
>
> Key: LUCENE-6879
> URL: https://issues.apache.org/jira/browse/LUCENE-6879
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: Trunk
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Labels: Java8
> Fix For: Trunk
>
> Attachments: LUCENE-6879.patch, LUCENE-6879.patch
>
>
> As a followup from LUCENE-6874, I thought about how to generate custom
> CharTokenizers wthout subclassing. I had this quite often and I was a bit
> annoyed, that you had to create a subclass every time.
> This issue is using the pattern like ThreadLocal or many collection methods
> in Java 8: You have the (abstract) base class and you define a factory method
> named {{fromXxxPredicate}} (like {{ThreadLocal.withInitial(() -> value}}).
> {code:java}
> public static CharTokenizer
> fromTokenCharPredicate(java.util.function.IntPredicate predicate)
> {code}
> This would allow to define a new CharTokenizer with a single line statement
> using any predicate:
> {code:java}
> // long variant with lambda:
> Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c ->
> !UCharacter.isUWhiteSpace(c));
> // method reference for separator char predicate + normalization by
> uppercasing:
> Tokenizer tok =
> CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace,
> Character::toUpperCase);
> // method reference to custom function:
> private boolean myTestFunction(int c) {
> return (cracy condition);
> }
> Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
> {code}
> I know this would not help Solr users that want to define the Tokenizer in a
> config file, but for real Lucene users this Java 8-like way would be easy and
> elegant to use. It is fast as hell, as it is just a reference to a method and
> Java 8 is optimized for that.
> The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow
> quick definition without lambdas using method references. In lots of cases,
> like WhitespaceTokenizer, predicates are on the separator chars
> ({{isWhitespace(int)}}, so using the 2nd set of factories you can define them
> without the counter-intuitive negation. Internally it just uses
> {{Predicate#negate()}}.
> The factories also allow to give the normalization function, e.g. to
> Lowercase, you may just give {{Character::toLowerCase}} as
> {{IntUnaryOperator}} reference.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]