[ 
https://issues.apache.org/jira/browse/UIMA-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193209#comment-14193209
 ] 

Peter Klügl commented on UIMA-4079:
-----------------------------------

Unfortunately, that's not that easy.

I assume that the problem can be observed when entries of two tokens are not 
assigned to feature values. I gonna explain the problem for word lists and 
dictionary lookup in general. It's the same thing for word tables.

Ruta provides a coverage-based concept of visibilty for rules. Text covered by 
an annotation of a type that is filtered is not visible to rules. One strength 
of the dictionary lookup in ruta is that it is also able to use this 
functionality. You can configure text spans that should be ignored by the 
dictionary lookup with FILTERTYPE and friends. This means that the dictionary 
lookup never sees a whitespace when the default filtering seetings are used. 
The actual string provided to the dictionary is not "Bill Clinton" but 
"BillClinton". Therefore, it does not matter if there is one space or several 
spaces (or any kind of invisible text) between "Bill" and "Clinton". If we 
would only use "getCoveredText()", then the lookup would fail in many 
scenarios. 

Dictionary entries like "Bill Clinton" are only found using the default 
filtering settings due to a convinience method that skips whitespaces in the 
trie (dictionary char nodes). This actually also causes the problem that 
sometimes entires are not found in the documents if the dictionary contains 
entries that provide ambiguous paths in the trie. 

I do not really want to change this strategy because it allows the user to 
specify whitespace-sensitive dictionaries, which contain entires with different 
combinations of whitespaces.

Afterall, the increased expressiveness comes with the price that users have 
problems applying the dictionaries. We should do sometime about that. I 
normally suggest removing all unimportant chars in the dictionary entries, but 
that is not really a convinient approach for users.

There are several things that we can do in order to improve it:
- I could introduce a parameter (in the engine) that when activated removes all 
whitespaces when the dictionaries are loaded. (However, we would need to 
consider multi tree word lists). This would lead to whitespace-insensitive 
dictionaries for all applied script files in the engine. 
- I could introduce a fall-back method that uses "getCoveredText" if 
"getVisibleCoveredText" has not found any entires, or a method that checks 
their existence ignoring spaces. This would suffice in most scenarios, but is 
not able to provide the complete fucntionaity because you never know or will 
never be able to reproduce the current visibility setting within the 
dictionaries. The annotations are simply not present.
- I could refactor the complete lookup process in order to remember the row of 
the table in which the entry was matched. Then, the problematic code mentioned 
in the question would not be necessary. However, this refactoring should not be 
done before the refactoring of the complete dictionary stuff.

Any opinions?


> MarkTable action not able to recognize entities with two or more words
> ----------------------------------------------------------------------
>
>                 Key: UIMA-4079
>                 URL: https://issues.apache.org/jira/browse/UIMA-4079
>             Project: UIMA
>          Issue Type: Bug
>          Components: ruta
>    Affects Versions: 2.2.2ruta
>            Reporter: Silvestre Losada
>             Fix For: 2.2.2ruta
>
>
> I think this error was introduced solving UIMA-4071. The problem is that  
> RutaStream.getVisibleCoveredText method removes whitespaces in covered text. 
> For example Bill Clinton covered text returns BillClinton.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to