[ 
https://issues.apache.org/jira/browse/OPENNLP-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno P. Kinoshita updated OPENNLP-1528:
----------------------------------------
    Description: 
I shared on Twitter about the issue with the word "ós" found in our tokenizer 
tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that our 
regexp for Catalan didn't seem right.

Created this issue so we can test & fix it.
{noformat}
Regexp is not fully correct. Catalan written language uses middle dot / 
interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar, 
cancel·lar,... {noformat}
 

!image-2023-12-11-15-20-31-518.png|width=365,height=429!

  was:
I shared on Twitter about the issue with the word "ós" found in our tokenizer 
tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that our 
regexp for Catalan didn't seem right.

Created this issue so we can test & fix it.

>Regexp is not fully correct. Catalan written language uses middle dot / 
>interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar, 
>cancel·lar,...

!image-2023-12-11-15-20-31-518.png|width=365,height=429!


> Review Catalan regexp for the ela germinada
> -------------------------------------------
>
>                 Key: OPENNLP-1528
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1528
>             Project: OpenNLP
>          Issue Type: Bug
>            Reporter: Bruno P. Kinoshita
>            Assignee: Bruno P. Kinoshita
>            Priority: Minor
>         Attachments: image-2023-12-11-15-20-31-518.png
>
>
> I shared on Twitter about the issue with the word "ós" found in our tokenizer 
> tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that 
> our regexp for Catalan didn't seem right.
> Created this issue so we can test & fix it.
> {noformat}
> Regexp is not fully correct. Catalan written language uses middle dot / 
> interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar, 
> cancel·lar,... {noformat}
>  
> !image-2023-12-11-15-20-31-518.png|width=365,height=429!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to