[
https://issues.apache.org/jira/browse/UIMA-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jasper Huzen updated UIMA-5723:
-------------------------------
Comment: was deleted
(was: The change / fix in UIMA-4556 cause some problems when using a CSV file
with whitespaces.
When setting param PARAM_DICT_REMOVE_WS to TRUE and don't have WS visible in
the token stream:
- all items in the dictionary will be recognized
- all items will also be recognized if you add whitespaces between words. For
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
If whitespaces are visible, words with spacers won't be recognized.
The problem that this cause is that the default hardcored value to ignore
whitespaces is always true:
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are
important) that won't be work. This matcher should use the same value as set in
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS
method.
I attached a patch to fix this issue. [^UIMA-5723.patch])
> MARKTABLE fails to assign feature for single word entry in first CSV column
> ---------------------------------------------------------------------------
>
> Key: UIMA-5723
> URL: https://issues.apache.org/jira/browse/UIMA-5723
> Project: UIMA
> Issue Type: Bug
> Components: Ruta
> Affects Versions: 2.6.1ruta
> Reporter: Andreas Thiel
> Assignee: Peter Klügl
> Priority: Major
>
> When using Ruta's MARKTABLE action with a CSV file {{nl_law_names.csv}} like
> this
> {code:xml}
> WAZ;WAZELF
> Wet arbeidsongeschiktheidsverzekering zelfstandigen;WAZELF
> {code}
> and corresponding Ruta script containing these lines
> {code:java}
> WORDTABLE LawNameTable = 'nl_law_names.csv';
> Document{->MARKTABLE(WetNaam, 1, LawNameTable, "WetIdentifier" = 2)};
> {code}
> it seems that the text {{WAZ}} is detected, but the {{WetIdentifier}} feature
> of the resulting annotation is not filled by the string following the
> semicolon. Instead, it remains empty.
> (Note: _WetNaam_ annotation is defined elsewhere via type system description)
> In contrast, the fully written name {{Wet arbeidsongeschiktheidsverzekering
> zelfstandigen}} is detected and processed as expected with feature
> WetIdentifier = WAZELF after annnotating.
> Could it be that problems arise when only a single word (i.e. no spaces or
> uppercase letters following lowercase chars) is present in the first column
> in the CSV file? Or is it a matter of configuration?
> We experimented also with the optional arguments of MARKTABLE regarding
> uppercase/lowercase distinction, but to no avail.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)