[jira] [Updated] (LUCENE-6879) Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

Uwe Schindler (JIRA) Tue, 03 Nov 2015 00:59:46 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-6879:
----------------------------------
    Description: 
As a followup from LUCENE-6874, I thought about how to generate custom 
CharTokenizers wthout subclassing. I had this quite often and I was a bit 
annoyed, that you had to create a subclass every time.

This issue is using the pattern like ThreadLocal or many collection methods in 
Java 8: You have the (abstract) base class and you define a factory method 
named {{fromXxxPredicate}} (like {{ThreadLocal.withInitial(() -> value}}).

{code:java}
public static CharTokenizer 
fromTokenCharPredicate(java.util.function.IntPredicate predicate)
{code}

This would allow to define a new CharTokenizer with a single line statement 
using any predicate:

{code:java}
// long variant with lambda:
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> 
!UCharacter.isUWhiteSpace(c));

// method reference for separator char predicate + normalization by uppercasing:
Tokenizer tok = 
CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, 
Character::toUpperCase);

// method reference to custom function:
private boolean myTestFunction(int c) {
 return (cracy condition);
}
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
{code}

I know this would not help Solr users that want to define the Tokenizer in a 
config file, but for real Lucene users this Java 8-like way would be easy and 
elegant to use. It is fast as hell, as it is just a reference to a method and 
Java 8 is optimized for that.

The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow 
quick definition without lambdas using method references. In lots of cases, 
like WhitespaceTokenizer, predicates are on the separator chars 
({{isWhitespace(int)}}, so using the 2nd set of factories you can define them 
without the counter-intuitive negation. Internally it just uses 
{{Predicate#negate()}}.

The factories also allow to give the normalization function, e.g. to Lowercase, 
you may just give {{Character::toLowerCase}} as {{IntUnaryOperator}} reference.

  was:
As a followup from LUCENE-6874, I thought about how to generate custom 
CharTokenizers wthout subclassing. I had this quite often and I was a bit 
annoyed, that you had to create a subclass every time.

This issue is using the pattern like ThreadLocal or many collection methods in 
Java 8: You have the (abstract) base class and you define a factory method 
named {{fromXxxPredicate}} (like {{ThreadLocal.fromInitial(() -> value}}).

{code:java}
public static CharTokenizer fromPredicate(java.util.function.IntPredicate 
predicate)
{code}

This would allow to define a new CharTokenizer with a single line statement 
using any predicate:

{code:java}
// long variant with lambda:
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> 
!UCharacter.isUWhiteSpace(c));

// method reference for separator char predicate + normalization by uppercasing:
Tokenizer tok = 
CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, 
Character::toUpperCase);

// method reference to custom function:
private boolean myTestFunction(int c) {
 return (cracy condition);
}
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
{code}

I know this would not help Solr users that want to define the Tokenizer in a 
config file, but for real Lucene users the Java 8-like way would be the 
following static method on CharTokenizer without subclassing. It is fast as 
hell, as it is just a reference to a method and Java 8 is optimized for that.

The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow 
quick definition without lambdas using method references. In lots of cases, 
like WhitespaceTokenizer, predicates are on the separator chars 
({{isWhitespace(int)}}, so using the 2nd set of factories you can define them 
without the counter-intuitive negation. Internally it just uses 
{{Predicate#negate()}}.

The factories also allow to give the normalization function, e.g. to Lowercase, 
you may just give {{Character::toLowerCase}} as {{IntUnaryOperator}} reference.


> Allow to define custom CharTokenizer using Java 8 Lambdas/Method references
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-6879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6879
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: Trunk
>            Reporter: Uwe Schindler
>             Fix For: Trunk
>
>         Attachments: LUCENE-6879.patch
>
>
> As a followup from LUCENE-6874, I thought about how to generate custom 
> CharTokenizers wthout subclassing. I had this quite often and I was a bit 
> annoyed, that you had to create a subclass every time.
> This issue is using the pattern like ThreadLocal or many collection methods 
> in Java 8: You have the (abstract) base class and you define a factory method 
> named {{fromXxxPredicate}} (like {{ThreadLocal.withInitial(() -> value}}).
> {code:java}
> public static CharTokenizer 
> fromTokenCharPredicate(java.util.function.IntPredicate predicate)
> {code}
> This would allow to define a new CharTokenizer with a single line statement 
> using any predicate:
> {code:java}
> // long variant with lambda:
> Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> 
> !UCharacter.isUWhiteSpace(c));
> // method reference for separator char predicate + normalization by 
> uppercasing:
> Tokenizer tok = 
> CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, 
> Character::toUpperCase);
> // method reference to custom function:
> private boolean myTestFunction(int c) {
>  return (cracy condition);
> }
> Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
> {code}
> I know this would not help Solr users that want to define the Tokenizer in a 
> config file, but for real Lucene users this Java 8-like way would be easy and 
> elegant to use. It is fast as hell, as it is just a reference to a method and 
> Java 8 is optimized for that.
> The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow 
> quick definition without lambdas using method references. In lots of cases, 
> like WhitespaceTokenizer, predicates are on the separator chars 
> ({{isWhitespace(int)}}, so using the 2nd set of factories you can define them 
> without the counter-intuitive negation. Internally it just uses 
> {{Predicate#negate()}}.
> The factories also allow to give the normalization function, e.g. to 
> Lowercase, you may just give {{Character::toLowerCase}} as 
> {{IntUnaryOperator}} reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6879) Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

Reply via email to