Moin, Plain easy to do customize with lambdas! E.g., an elegant way to create a tokenizer which behaves exactly as WhitespaceTokenizer and LowerCaseFilter is:
Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace, Character::toLowerCase); Adjust with Lambdas and you can create any tokenizer based on any character check, so to check for whitespace or underscore: Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch -> Character.isWhitespace || ch == '_'); Uwe ----- Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com] > Sent: Monday, January 8, 2018 11:30 AM > To: java-user@lucene.apache.org > Subject: Looking For Tokenizer With Custom Delimeter > > Hi, > > I am looking for a tokenizer, where I could specify a delimiter by which > the words are tokenized, for example if I choose the delimiters as ' ' and > '_' the following string: > "foo__bar doo" > would be tokenized into: > "foo", "", "bar", "doo" > (The analyzer could further filter empty tokens, since having the empty > string token is not critical). > > Is such functionality built into Lucene (I'm working with 7.1.0) and does > this seem like the correct approach to the problem? > > Regards, > Armīns --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org