[
https://issues.apache.org/jira/browse/OPENNLP-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bruno P. Kinoshita updated OPENNLP-1528:
----------------------------------------
Description:
I shared on Twitter about the issue with the word "ós" found in our tokenizer
tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that our
regexp for Catalan didn't seem right.
Created this issue so we can test & fix it.
{noformat}
Regexp is not fully correct. Catalan written language uses middle dot /
interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar,
cancel·lar,... {noformat}
!image-2023-12-11-15-20-31-518.png|width=365,height=429!
was:
I shared on Twitter about the issue with the word "ós" found in our tokenizer
tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that our
regexp for Catalan didn't seem right.
Created this issue so we can test & fix it.
>Regexp is not fully correct. Catalan written language uses middle dot /
>interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar,
>cancel·lar,...
!image-2023-12-11-15-20-31-518.png|width=365,height=429!
> Review Catalan regexp for the ela germinada
> -------------------------------------------
>
> Key: OPENNLP-1528
> URL: https://issues.apache.org/jira/browse/OPENNLP-1528
> Project: OpenNLP
> Issue Type: Bug
> Reporter: Bruno P. Kinoshita
> Assignee: Bruno P. Kinoshita
> Priority: Minor
> Attachments: image-2023-12-11-15-20-31-518.png
>
>
> I shared on Twitter about the issue with the word "ós" found in our tokenizer
> tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that
> our regexp for Catalan didn't seem right.
> Created this issue so we can test & fix it.
> {noformat}
> Regexp is not fully correct. Catalan written language uses middle dot /
> interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar,
> cancel·lar,... {noformat}
>
> !image-2023-12-11-15-20-31-518.png|width=365,height=429!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)