Jasper Huzen created UIMA-5752:
----------------------------------

             Summary: Problem with matching items in MarkTable with 
whitespacers visible
                 Key: UIMA-5752
                 URL: https://issues.apache.org/jira/browse/UIMA-5752
             Project: UIMA
          Issue Type: Bug
          Components: Ruta
    Affects Versions: 2.6.1ruta
            Reporter: Jasper Huzen


The change / fix in UIMA-4556 cause some problems when using a CSV file with 
whitespaces.

When we have a dictionary with whitespaces between words and
 * Param PARAM_DICT_REMOVE_WS is TRUE:

When WS are visible in the token stream:
- words with spacers are not recognized (as expected).

When WS are NOT visible in the token stream:
- all items in the dictionary will be recognized
- all items will also be recognized if you add whitespaces between words. For 
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
 * Param PARAM_DICT_REMOVE_WS is FALSE:

When WS are visible in the token stream:
- not all entries in the dictionary will be recognized

When WS are NOT visible in the token stream:
- also not all entries in the dictionary will be recognized


The problem that this cause is that the default value to ignore whitespaces is 
always true (hardcoded).
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are 
important) that won't  work. The matcher should use the same value as set in 
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
method.

I attached a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to