[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174026#comment-13174026
 ] 

Robert Muir commented on LUCENE-3663:
-------------------------------------

I think actually that we should not remove tokens that aren't phone numbers. 
sometimes there just might be other
things instead of phone numbers, or maybe the phone number 
detection/normalization is just imperfect so its better
 to not throw away, instead just no normalization happens, like a stemmer.

In general we can also assume the text is unstructured and might have other 
stuff (this implies someone has a super-cool 
tokenizer that doesnt split up any dirty phone numbers, but we just leave the 
possibility)

Then i think the while loop could be removed, if the phone number normalization 
succeeds mark the type as phone.
Otherwise in the exception case, output it unchanged.

then non-phonenumbers or whatever can be easily filtered out separately with a 
subclass of FilteringTokenFilter.
                
> Add a phone number normalization TokenFilter
> --------------------------------------------
>
>                 Key: LUCENE-3663
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Santiago M. Mola
>            Priority: Minor
>         Attachments: PhoneFilter.java
>
>
> Phone numbers can be found in the wild in an infinity variety of formats 
> (e.g. with spaces, parenthesis, dashes, with or without country code, with 
> letters in substitution of numbers). So some Lucene applications can benefit 
> of phone normalization with a TokenFilter that gets a phone number in any 
> format, and outputs it in a standard format, using a default country to guess 
> country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to