[jira] [Commented] (LUCENE-6879) Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

Uwe Schindler (JIRA) Mon, 02 Nov 2015 15:04:38 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986222#comment-14986222
 ]


Uwe Schindler commented on LUCENE-6879:
---------------------------------------

We can improve the Javadocs by adding the examples. I just wanted to quickly 
write the patch to demonstrate how it could look like. We can also discuss 
about method names. The pattern follows convention used for all functional 
interfaces in Java 8 (method naming), but we can make it more readable. I am 
open to suggestions.

In Lucene trunk we can also remove all the separate implementations like 
LetterTokenizer and just allow them to be produced by factories. This would be 
a slight break, but we could still provide the Solr/CustomAnalyzer factories as 
usual. The Tokenizer for ICU in LUCENE-6874 could also be a one-liner just 
provided by the Solr factory, but no actual instance :-)

We could also provide a one-for all Solr/CustomAnalyzer factory using a Enum of 
predicate/normalizer functions to be choosen by string parameter.

> Allow to define custom CharTokenizer using Java 8 Lambdas/Method references
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-6879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6879
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: Trunk
>            Reporter: Uwe Schindler
>             Fix For: Trunk
>
>         Attachments: LUCENE-6879.patch
>
>
> As a followup from LUCENE-6874, I thought about how to generate custom 
> CharTokenizers wthout subclassing. I had this quite often and I was a bit 
> annoyed, that you had to create a subclass every time.
> This issue is using the pattern like ThreadLocal or many collection methods 
> in Java 8: You have the (abstract) base class and you define a factory method 
> named {{fromXxxPredicate}} (like {{ThreadLocal.fromInitial(() -> value}}).
> {code:java}
> public static CharTokenizer fromPredicate(java.util.function.IntPredicate 
> predicate)
> {code}
> This would allow to define a new CharTokenizer with a single line statement 
> using any predicate:
> {code:java}
> // long variant with lambda:
> Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> 
> !UCharacter.isUWhiteSpace(c));
> // method reference for separator char predicate + normalization by 
> uppercasing:
> Tokenizer tok = 
> CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, 
> Character::toUpperCase);
> // method reference to custom function:
> private boolean myTestFunction(int c) {
>  return (cracy condition);
> }
> Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
> {code}
> I know this would not help Solr users that want to define the Tokenizer in a 
> config file, but for real Lucene users the Java 8-like way would be the 
> following static method on CharTokenizer without subclassing. It is fast as 
> hell, as it is just a reference to a method and Java 8 is optimized for that.
> The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow 
> quick definition without lambdas using method references. In lots of cases, 
> like WhitespaceTokenizer, predicates are on the separator chars 
> ({{isWhitespace(int)}}, so using the 2nd set of factories you can define them 
> without the counter-intuitive negation. Internally it just uses 
> {{Predicate#negate()}}.
> The factories also allow to give the normalization function, e.g. to 
> Lowercase, you may just give {{Character::toLowerCase}} as 
> {{IntUnaryOperator}} reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6879) Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

Reply via email to