Uwe Schindler created LUCENE-6879:
-------------------------------------
Summary: Allow to define custom CharTokenizer using Java 8
Lambdas/Method references
Key: LUCENE-6879
URL: https://issues.apache.org/jira/browse/LUCENE-6879
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Affects Versions: Trunk
Reporter: Uwe Schindler
Fix For: Trunk
As a followup from LUCENE-6874, I thought about how to generate custom
CharTokenizers wthout subclassing. I had this quite often and I was a bit
annoyed, that you had to create a subclass every time.
This issue is using the pattern like ThreadLocal or many collection methods in
Java 8: You have the (abstract) base class and you define a factory method
named {{fromXxxPredicate}} (like {{ThreadLocal.fromInitial(() -> value}}).
{code:java}
public static CharTokenizer fromPredicate(java.util.function.IntPredicate
predicate)
{code}
This would allow to define a new CharTokenizer with a single line statement
using any predicate:
{code:java}
// long variant with lambda:
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c ->
!UCharacter.isUWhiteSpace(c));
// method reference for separator char predicate + normalization by uppercasing:
Tokenizer tok =
CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace,
Character::toUpperCase);
// method reference to custom function:
private boolean myTestFunction(int c) {
return (cracy condition);
}
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
{code}
I know this would not help Solr users that want to define the Tokenizer in a
config file, but for real Lucene users the Java 8-like way would be the
following static method on CharTokenizer without subclassing. It is fast as
hell, as it is just a reference to a method and Java 8 is optimized for that.
The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow
quick definition without lambdas using method references. In lots of cases,
like WhitespaceTokenizer, predicates are on the separator chars
({{isWhitespace(int)}}, so using the 2nd set of factories you can define them
without the counter-intuitive negation. Internally it just uses
{{Predicate#negate()}}.
The factories also allow to give the normalization function, e.g. to Lowercase,
you may just give {{Character::toLowerCase}} as {{IntUnaryOperator}} reference.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]