[jira] Updated: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

Shyamal Prasad (JIRA) Thu, 17 Dec 2009 17:08:47 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shyamal Prasad updated LUCENE-2167:
-----------------------------------

    Description: 
The Javadoc for StandardTokenizer states:

{quote}
Splits words at punctuation characters, removing punctuation. 
However, a dot that's not followed by whitespace is considered part of a token.

Splits words at hyphens, unless there's a number in the token, in which case 
the whole 
token is interpreted as a product number and is not split.
{quote}

This is not accurate. The actual JFlex implementation treats hyphens 
interchangeably with
punctuation. So, for example "video,mp4,test" results in a *single* token and 
not three tokens
as the documentation would suggest.

Additionally, the documentation suggests that "video-mp4-test-again" would 
become a single
token, but in reality it results in two tokens: "video-mp4-test" and "again".

IMHO the parser implementation is fine as is since it is hard to keep everyone 
happy, but it is probably
worth cleaning up the documentation string. 

The patch included here updates the documentation string and adds a few test 
cases to confirm the cases described above.

  was:
The Javadoc for StandardTokenization states:

{quote}
Splits words at punctuation characters, removing punctuation. 
However, a dot that's not followed by whitespace is considered part of a token.

Splits words at hyphens, unless there's a number in the token, in which case 
the whole 
token is interpreted as a product number and is not split.
{quote}

This is not accurate. The actual JFlex implementation treats hyphens 
interchangeably with
punctuation. So, for example "video,mp4,test" results in a *single* token and 
not three tokens
as the documentation would suggest.

Additionally, the documentation suggests that "video-mp4-test-again" would 
become a single
token, but in reality it results in two tokens: "video-mp4-test" and "again".

IMHO the parser implementation is fine as is since it is hard to keep everyone 
happy, but it is probably
worth cleaning up the documentation string.


> StandardTokenizer Javadoc does not correctly describe tokenization around 
> punctuation characters
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
>            Reporter: Shyamal Prasad
>            Priority: Minor
>         Attachments: LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The Javadoc for StandardTokenizer states:
> {quote}
> Splits words at punctuation characters, removing punctuation. 
> However, a dot that's not followed by whitespace is considered part of a 
> token.
> Splits words at hyphens, unless there's a number in the token, in which case 
> the whole 
> token is interpreted as a product number and is not split.
> {quote}
> This is not accurate. The actual JFlex implementation treats hyphens 
> interchangeably with
> punctuation. So, for example "video,mp4,test" results in a *single* token and 
> not three tokens
> as the documentation would suggest.
> Additionally, the documentation suggests that "video-mp4-test-again" would 
> become a single
> token, but in reality it results in two tokens: "video-mp4-test" and "again".
> IMHO the parser implementation is fine as is since it is hard to keep 
> everyone happy, but it is probably
> worth cleaning up the documentation string. 
> The patch included here updates the documentation string and adds a few test 
> cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

Reply via email to