Thanks, will try. > On 15 дек. 2014 г., at 21:02, Allison, Timothy B. <talli...@mitre.org> wrote: > > If you can't change the analyzer, you can programmatically build a > MultiPhraseQuery (you'd have to fill in the alternatives ... not a great > option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites > are taken care of for you). > > You might also want to look into using the ComplexPhraseQueryParser: > > "/5{1}<1-5>{1}<0-9>{2}/ /<0-9>{4}/ /<0-9>{4}/ /<0-9>{4}/" > > Make sure to "or" that with the regex to capture the "phrase" without > spaces/hyphens: "5{1}<1-5>{1}<0-9>{14}" > > I can't vouch for performance with the above options... > > Whichever path you take, make sure that the MultiTermQuery.RewriteMethod > and/or maxBooleanClauses are set appropriately. > > -----Original Message----- > From: Valentin Popov [mailto:valentin...@gmail.com] > Sent: Monday, December 15, 2014 8:35 AM > To: java-user@lucene.apache.org > Subject: Re: multiterm numbers regexp search > > Mike, thanks. > > Problem is that we cant change analyzer, as bank need a search not only for > card numbers for compliance and already exist storage is hundred millions of > emails. My thinking is make multiterm regexp search query, or search of > combination of regexp queries with some distance between them. Main idea is > to search possible combination of digits, as them has a rule, for mastercard > it is start with five, second number must be between 1-5 other 14 must be > digits. > > Thanks > > >> On 15 дек. 2014 г., at 16:00, Michael Sokolov >> <msoko...@safaribooksonline.com> wrote: >> >> You probably don't want to use StandardAnalyzer: maybe try >> WhitespaceAnalyzer, but you'll need to enhance your regex a little to deal >> with punctuation since WA may give you tokens like: >> >> 5106-7922-9469-8422. >> >> "5106-7922-9469-8422" >> >> etc >> >> -Mike >> >> On 12/15/14 3:45 AM, Valentin Popov wrote: >>> I have a need to find mastercard numbers with regular expression. >>> >>> I’m using Query query = new RegexpQuery(new Term("body", >>> "5{1}<1-5>{1}<0-9>{14}"), RegExp.ALL) to search numbers in email’s body and >>> StandardAnalizer used for body indexing. So number like 5106792294698422 >>> will be indexed as it is and all mastercard numbers will be on search >>> results, but numbers like 5106 7922 9469 8422 will be indexed as 4 tokens >>> 5106, 7922, 9469, 8422, simular for 5106-7922-9469-8422. >>> >>> Any ideas how to find the sequence of numbers with spaces, dashes etc? >>> Maybe multiterm regexp search query? >>> >>> >>> Regards, >>> Valentin Popov >>> >>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > Regards, > Valentin Popov > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >
Regards, Valentin Popov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org