Re: Looking For Tokenizer With Custom Delimeter

Armins Stepanjans Mon, 08 Jan 2018 05:10:05 -0800

Thanks for the solution, however I am unable to access CharTokenizer class,
when I import it using:


import org.apache.lucene.analysis.util.*;

Although I am able to access classes directly under analysis (or
analysis.standard) just fine with the import statement:
import org.apache.lucene.analysis.*;

Does this appear as a Lucene specific problem?

P.S. I'm using Maven for managing my dependencies with the following two
statements for Lucene:

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>7.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>7.1.0</version>
        </dependency>

Regards,
Armīns

On Mon, Jan 8, 2018 at 12:53 PM, Uwe Schindler <[email protected]> wrote:

> Moin,
>
> Plain easy to do customize with lambdas! E.g., an elegant way to create a
> tokenizer which behaves exactly as WhitespaceTokenizer and LowerCaseFilter
> is:
>
> Tokenizer tok = 
> CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace,
> Character::toLowerCase);
>
> Adjust with Lambdas and you can create any tokenizer based on any
> character check, so to check for whitespace or underscore:
>
> Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch ->
> Character.isWhitespace || ch == '_');
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
> > -----Original Message-----
> > From: Armins Stepanjans [mailto:[email protected]]
> > Sent: Monday, January 8, 2018 11:30 AM
> > To: [email protected]
> > Subject: Looking For Tokenizer With Custom Delimeter
> >
> > Hi,
> >
> > I am looking for a tokenizer, where I could specify a delimiter by which
> > the words are tokenized, for example if I choose the delimiters as ' '
> and
> > '_' the following string:
> > "foo__bar doo"
> > would be tokenized into:
> > "foo", "", "bar", "doo"
> > (The analyzer could further filter empty tokens, since having the empty
> > string token is not critical).
> >
> > Is such functionality built into Lucene (I'm working with 7.1.0) and does
> > this seem like the correct approach to the problem?
> >
> > Regards,
> > Armīns
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Looking For Tokenizer With Custom Delimeter

Reply via email to