[
https://issues.apache.org/jira/browse/UIMA-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193209#comment-14193209
]
Peter Klügl commented on UIMA-4079:
-----------------------------------
Unfortunately, that's not that easy.
I assume that the problem can be observed when entries of two tokens are not
assigned to feature values. I gonna explain the problem for word lists and
dictionary lookup in general. It's the same thing for word tables.
Ruta provides a coverage-based concept of visibilty for rules. Text covered by
an annotation of a type that is filtered is not visible to rules. One strength
of the dictionary lookup in ruta is that it is also able to use this
functionality. You can configure text spans that should be ignored by the
dictionary lookup with FILTERTYPE and friends. This means that the dictionary
lookup never sees a whitespace when the default filtering seetings are used.
The actual string provided to the dictionary is not "Bill Clinton" but
"BillClinton". Therefore, it does not matter if there is one space or several
spaces (or any kind of invisible text) between "Bill" and "Clinton". If we
would only use "getCoveredText()", then the lookup would fail in many
scenarios.
Dictionary entries like "Bill Clinton" are only found using the default
filtering settings due to a convinience method that skips whitespaces in the
trie (dictionary char nodes). This actually also causes the problem that
sometimes entires are not found in the documents if the dictionary contains
entries that provide ambiguous paths in the trie.
I do not really want to change this strategy because it allows the user to
specify whitespace-sensitive dictionaries, which contain entires with different
combinations of whitespaces.
Afterall, the increased expressiveness comes with the price that users have
problems applying the dictionaries. We should do sometime about that. I
normally suggest removing all unimportant chars in the dictionary entries, but
that is not really a convinient approach for users.
There are several things that we can do in order to improve it:
- I could introduce a parameter (in the engine) that when activated removes all
whitespaces when the dictionaries are loaded. (However, we would need to
consider multi tree word lists). This would lead to whitespace-insensitive
dictionaries for all applied script files in the engine.
- I could introduce a fall-back method that uses "getCoveredText" if
"getVisibleCoveredText" has not found any entires, or a method that checks
their existence ignoring spaces. This would suffice in most scenarios, but is
not able to provide the complete fucntionaity because you never know or will
never be able to reproduce the current visibility setting within the
dictionaries. The annotations are simply not present.
- I could refactor the complete lookup process in order to remember the row of
the table in which the entry was matched. Then, the problematic code mentioned
in the question would not be necessary. However, this refactoring should not be
done before the refactoring of the complete dictionary stuff.
Any opinions?
> MarkTable action not able to recognize entities with two or more words
> ----------------------------------------------------------------------
>
> Key: UIMA-4079
> URL: https://issues.apache.org/jira/browse/UIMA-4079
> Project: UIMA
> Issue Type: Bug
> Components: ruta
> Affects Versions: 2.2.2ruta
> Reporter: Silvestre Losada
> Fix For: 2.2.2ruta
>
>
> I think this error was introduced solving UIMA-4071. The problem is that
> RutaStream.getVisibleCoveredText method removes whitespaces in covered text.
> For example Bill Clinton covered text returns BillClinton.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)