Re: multiterm numbers regexp search

Valentin Popov Mon, 15 Dec 2014 23:37:37 -0800

Thanks, will try. 
> On 15 дек. 2014 г., at 21:02, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
> If you can't change the analyzer, you can programmatically build a 
> MultiPhraseQuery (you'd have to fill in the alternatives ... not a great 
> option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites 
> are taken care of for you).
> 
> You might also want to look into using the ComplexPhraseQueryParser:
> 
> "/5{1}<1-5>{1}<0-9>{2}/ /<0-9>{4}/ /<0-9>{4}/ /<0-9>{4}/"
> 
> Make sure to "or" that with the regex to capture the "phrase" without 
> spaces/hyphens: "5{1}<1-5>{1}<0-9>{14}"
> 
> I can't vouch for performance with the above options...
> 
> Whichever path you take, make sure that the MultiTermQuery.RewriteMethod 
> and/or maxBooleanClauses are set appropriately.
> 
> -----Original Message-----
> From: Valentin Popov [mailto:valentin...@gmail.com] 
> Sent: Monday, December 15, 2014 8:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: multiterm numbers regexp search
> 
> Mike, thanks. 
> 
> Problem is that we cant change analyzer, as bank need a search not only for 
> card numbers for compliance and already exist storage is hundred millions of 
> emails. My thinking is make multiterm regexp search query, or search of 
> combination of regexp queries with some distance between them. Main idea is 
> to search possible combination of digits, as them has a rule, for mastercard 
> it is start with five, second number must be between 1-5 other 14 must be 
> digits. 
> 
> Thanks 
> 
> 
>> On 15 дек. 2014 г., at 16:00, Michael Sokolov 
>> <msoko...@safaribooksonline.com> wrote:
>> 
>> You probably don't want to use StandardAnalyzer: maybe try 
>> WhitespaceAnalyzer, but you'll need to enhance your regex a little to deal 
>> with  punctuation since WA may give you tokens like:
>> 
>> 5106-7922-9469-8422.
>> 
>> "5106-7922-9469-8422"
>> 
>> etc
>> 
>> -Mike
>> 
>> On 12/15/14 3:45 AM, Valentin Popov wrote:
>>> I have a need to find mastercard numbers with regular expression.
>>> 
>>> I’m using Query query = new RegexpQuery(new Term("body", 
>>> "5{1}<1-5>{1}<0-9>{14}"), RegExp.ALL) to search numbers in email’s body and 
>>> StandardAnalizer used for body indexing. So number like 5106792294698422 
>>> will be indexed as it is and all mastercard numbers will be on search 
>>> results, but numbers like 5106 7922 9469 8422 will be indexed as 4 tokens 
>>> 5106, 7922, 9469, 8422, simular for 5106-7922-9469-8422.
>>> 
>>> Any ideas how to find the sequence of numbers with spaces, dashes etc? 
>>> Maybe multiterm regexp search query?
>>> 
>>> 
>>> Regards,
>>> Valentin Popov
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> Regards,
> Valentin Popov
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


Regards,
Valentin Popov





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: multiterm numbers regexp search

Reply via email to