[jira] [Commented] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-07-03 Thread Jasper Huzen (JIRA)


[ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531717#comment-16531717
 ] 

Jasper Huzen commented on UIMA-5752:


Added patch again. Company approved to me. 

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Assignee: Peter Klügl
>Priority: Major
> Attachments: UIMA-5752 
> Fix_issue_with_ignoreWS_in_MarkTableAction.patch
>
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-07-03 Thread Jasper Huzen (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5752:
---
Attachment: UIMA-5752 Fix_issue_with_ignoreWS_in_MarkTableAction.patch

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Assignee: Peter Klügl
>Priority: Major
> Attachments: UIMA-5752 
> Fix_issue_with_ignoreWS_in_MarkTableAction.patch
>
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-07-03 Thread Jasper Huzen (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5752:
---
Comment: was deleted

(was: Patch removed because it was not complete. )

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Assignee: Peter Klügl
>Priority: Major
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5777) Incorrect feature assignment on MARKTABLE because incorrect record can be used

2018-05-28 Thread Jasper Huzen (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5777:
---
Flags:   (was: Patch)

> Incorrect feature assignment on MARKTABLE because incorrect record can be used
> --
>
> Key: UIMA-5777
> URL: https://issues.apache.org/jira/browse/UIMA-5777
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5777.patch
>
>
> Feature assignment with MARKTABLE can go wrong. 
> Let assume that we have a CSV table with the following entries:
> ||Matching||ID||
> |First Item|1|
> |Second Item|2|
> |SECOND ITEM|3|
> and we use the MARKTABLE action so that we match on the first column and 
> assign the second column to a feature of the annotaion.
> If we do the match case sensitive and we use the following input as document: 
> "SECOND ITEM" the annotation will get "2" as feature value.
> The MARKTABLE action match correct on "SECOND ITEM". In the next step it 
> tries to set the features. Therefore it call the getRowWhere method on the 
> (csv) table to get the related row in the table.
> The code in the getRowWhere method always compare the tableValue and lookup 
> value lowercase and that will result in the record with "Second Item" in it. 
> That's incorrect because we need the one with  "SECOND ITEM". 
> I modified the code so that it first start with an exact match. If no 
> matching item is found and we ignore case it will also do a case insensitive 
> check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5777) Incorrect feature assignment on MARKTABLE because incorrect record can be used

2018-05-16 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5777:
---
Attachment: UIMA-5777.patch

> Incorrect feature assignment on MARKTABLE because incorrect record can be used
> --
>
> Key: UIMA-5777
> URL: https://issues.apache.org/jira/browse/UIMA-5777
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5777.patch
>
>
> Feature assignment with MARKTABLE can go wrong. 
> Let assume that we have a CSV table with the following entries:
> ||Matching||ID||
> |First Item|1|
> |Second Item|2|
> |SECOND ITEM|3|
> and we use the MARKTABLE action so that we match on the first column and 
> assign the second column to a feature of the annotaion.
> If we do the match case sensitive and we use the following input as document: 
> "SECOND ITEM" the annotation will get "2" as feature value.
> The MARKTABLE action match correct on "SECOND ITEM". In the next step it 
> tries to set the features. Therefore it call the getRowWhere method on the 
> (csv) table to get the related row in the table.
> The code in the getRowWhere method always compare the tableValue and lookup 
> value lowercase and that will result in the record with "Second Item" in it. 
> That's incorrect because we need the one with  "SECOND ITEM". 
> I modified the code so that it first start with an exact match. If no 
> matching item is found and we ignore case it will also do a case insensitive 
> check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (UIMA-5777) Incorrect feature assignment on MARKTABLE because incorrect record can be used

2018-05-16 Thread Jasper Huzen (JIRA)
Jasper Huzen created UIMA-5777:
--

 Summary: Incorrect feature assignment on MARKTABLE because 
incorrect record can be used
 Key: UIMA-5777
 URL: https://issues.apache.org/jira/browse/UIMA-5777
 Project: UIMA
  Issue Type: Bug
  Components: Ruta
Affects Versions: 2.6.1ruta
Reporter: Jasper Huzen


Feature assignment with MARKTABLE can go wrong. 

Let assume that we have a CSV table with the following entries:
||Matching||ID||
|First Item|1|
|Second Item|2|
|SECOND ITEM|3|

and we use the MARKTABLE action so that we match on the first column and assign 
the second column to a feature of the annotaion.

If we do the match case sensitive and we use the following input as document: 
"SECOND ITEM" the annotation will get "2" as feature value.

The MARKTABLE action match correct on "SECOND ITEM". In the next step it tries 
to set the features. Therefore it call the getRowWhere method on the (csv) 
table to get the related row in the table.

The code in the getRowWhere method always compare the tableValue and lookup 
value lowercase and that will result in the record with "Second Item" in it. 
That's incorrect because we need the one with  "SECOND ITEM". 

I modified the code so that it first start with an exact match. If no matching 
item is found and we ignore case it will also do a case insensitive check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-05-16 Thread Jasper Huzen (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477320#comment-16477320
 ] 

Jasper Huzen commented on UIMA-5752:


I send my personal ICLA to Apache. I'm still waiting on our company to accept 
CCLA

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Assignee: Peter Klügl
>Priority: Major
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5775) Performance problem MARKTABLE when matching case insensitive

2018-05-14 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5775:
---
Description: 
Hi,

We encounter a performance issue (or maybe infinitive loop) when we use the 
MARKTABLE action, with case insenstive valuelists.

The call in our script is:
{code:java}
ADDRETAINTYPE(WS);
MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
"lawIdentifier" = 2);{code}
Using the following input fragment will result in a timeout exception after 1 
minute.
{code:java}
Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
That complete name is a Dutch lawname and also be an entry of the 
_nl_law_names.csv_ file.

When we try to match it and we have the ignoreCase flag to false, it is no 
problem and fast.. If we toggle that flag to true (case is ignored), the 
matching is really slow or even hanging in an infinitive loop.

I debugged the code and pinpoint me to the _TreeWordList_ class. The recursive 
method _recursiveContains_ have a potential bug. 

I think that the problem is when the item have a special character, that it is 
the same character in upper and lowercase. The recursive method will then 
look/fork twice on the same tree item.

I made a fix that checks if the uppercase character is the same as the 
lowercase character, and in that case it only do the recursive call once. That 
solved the (performance) issue but I'm not sure if this is really the main 
problem and the current fix is the best fix for this.

  was:
Hi,

We encounter a performance issue (or maybe infinitive loop) when we use the 
MARKTABLE action, with case insenstive valuelists.

The call in our script is:
{code:java}
ADDRETAINTYPE(WS);
MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
"lawIdentifier" = 2);{code}
Using the following input fragment will result in a timeout exception after 1 
minute.
{code:java}
Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
That complete name is a Dutch lawname and also be an entry of the 
_nl_law_names.csv_ file.

When we try to match it and we have the ignoreCase flag to false, it is no 
problem and fast.. If we toggle that flag to true (case is ignored), the 
matching is really slow or even hanging in an infinitive loop.

I debugged the code and pinpoint me to the _TreeWordList_ class. The recursive 
method _recursiveContains_ have a potential bug. 

I think that the problem is when the item have a special character, that it is 
the same character in upper and lowercase. The recursive method will then 
look/fork twice on the same tree item.

I made a fix that check if the uppercase is the same character as the 
lowercase, and in that case it only do the recursive call once. That solved the 
(performance) issue but I'm not sure if this is really the main problem and the 
current fix is the best fix for this.


> Performance problem MARKTABLE when matching case insensitive
> 
>
> Key: UIMA-5775
> URL: https://issues.apache.org/jira/browse/UIMA-5775
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5775.patch
>
>
> Hi,
> We encounter a performance issue (or maybe infinitive loop) when we use the 
> MARKTABLE action, with case insenstive valuelists.
> The call in our script is:
> {code:java}
> ADDRETAINTYPE(WS);
> MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
> "lawIdentifier" = 2);{code}
> Using the following input fragment will result in a timeout exception after 1 
> minute.
> {code:java}
> Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
> concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
> That complete name is a Dutch lawname and also be an entry of the 
> _nl_law_names.csv_ file.
> When we try to match it and we have the ignoreCase flag to false, it is no 
> problem and fast.. If we toggle that flag to true (case is ignored), the 
> matching is really slow or even hanging in an infinitive loop.
> I debugged the code and pinpoint me to the _TreeWordList_ class. The 
> recursive method _recursiveContains_ have a potential bug. 
> I think that the problem is when the item have a special character, that it 
> is the same character in upper and lowercase. The recursive method will then 
> look/fork twice on the same tree item.
> I made a fix that checks if the uppercase character is the same as the 
> lowercase character, and in that case it only do the recursive call once. 
> That solved the (performance) issue but I'm not sure if this 

[jira] [Updated] (UIMA-5775) Performance problem MARKTABLE when matching case insensitive

2018-05-14 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5775:
---
Attachment: UIMA-5775.patch

> Performance problem MARKTABLE when matching case insensitive
> 
>
> Key: UIMA-5775
> URL: https://issues.apache.org/jira/browse/UIMA-5775
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5775.patch
>
>
> Hi,
> We encounter a performance issue (or maybe infinitive loop) when we use the 
> MARKTABLE action, with case insenstive valuelists.
> The call in our script is:
> {code:java}
> ADDRETAINTYPE(WS);
> MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
> "lawIdentifier" = 2);{code}
> Using the following input fragment will result in a timeout exception after 1 
> minute.
> {code:java}
> Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
> concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
> That complete name is a Dutch lawname and also be an entry of the 
> _nl_law_names.csv_ file.
> When we try to match it and we have the ignoreCase flag to false, it is no 
> problem and fast.. If we toggle that flag to true (case is ignored), the 
> matching is really slow or even hanging in an infinitive loop.
> I debugged the code and pinpoint me to the _TreeWordList_ class. The 
> recursive method _recursiveContains_ have a potential bug. 
> I think that the problem is when the item have a special character, that it 
> is the same character in upper and lowercase. The recursive method will then 
> look/fork twice on the same tree item.
> I made a fix that check if the uppercase is the same character as the 
> lowercase, and in that case it only do the recursive call once. That solved 
> the (performance) issue but I'm not sure if this is really the main problem 
> and the current fix is the best fix for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5775) Performance problem MARKTABLE when matching case insensitive

2018-05-14 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5775:
---
Description: 
Hi,

We encounter a performance issue (or maybe infinitive loop) when we use the 
MARKTABLE action, with case insenstive valuelists.

The call in our script is:
{code:java}
ADDRETAINTYPE(WS);
MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
"lawIdentifier" = 2);{code}
Using the following input fragment will result in a timeout exception after 1 
minute.
{code:java}
Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
That complete name is a Dutch lawname and also be an entry of the 
_nl_law_names.csv_ file.

When we try to match it and we have the ignoreCase flag to false, it is no 
problem and fast.. If we toggle that flag to true (case is ignored), the 
matching is really slow or even hanging in an infinitive loop.

I debugged the code and pinpoint me to the _TreeWordList_ class. The recursive 
method _recursiveContains_ have a potential bug. 

I think that the problem is when the item have a special character, that it is 
the same character in upper and lowercase. The recursive method will then 
look/fork twice on the same tree item.

I made a fix that check if the uppercase is the same character as the 
lowercase, and in that case it only do the recursive call once. That solved the 
(performance) issue but I'm not sure if this is really the main problem and the 
current fix is the best fix for this.

  was:
Hi,

We encounter a performance issue (or maybe infinitive loop) when we use the 
MARKTABLE action, with case insenstive valuelists.

The call in our script is:
{code:java}
ADDRETAINTYPE(WS);
MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
"lawIdentifier" = 2);{code}

Using the following input fragment will result in a timeout exception after 1 
minute.
{code:java}
Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
That complete name is a Dutch lawname and also be an entry of the 
_nl_law_names.csv_ file. 

When we try to match it and we have the ignoreCase flag to false, it is no 
problem and fast.. If we toggle that flag to true (case is ignored), the 
matching is really slow or even hanging in an infinitive loop.

I debugged the code and pinpoint me to the _TreeWordList_ class. The recursive 
method _recursiveContains_ have a potential bug. 

I think that the problem is when the item have a special character, that it is 
the same character in upper and lowercase. The recursive method will then 
look/fork twice on the same tree item.

I made a fix that check if the uppercase is the same character as the 
lowercase, and in that case it only do the recursive call once. That solved the 
performance issue but I'm not sure if this is really the main problem and the 
current fix is the best fix for this.


> Performance problem MARKTABLE when matching case insensitive
> 
>
> Key: UIMA-5775
> URL: https://issues.apache.org/jira/browse/UIMA-5775
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
>
> Hi,
> We encounter a performance issue (or maybe infinitive loop) when we use the 
> MARKTABLE action, with case insenstive valuelists.
> The call in our script is:
> {code:java}
> ADDRETAINTYPE(WS);
> MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
> "lawIdentifier" = 2);{code}
> Using the following input fragment will result in a timeout exception after 1 
> minute.
> {code:java}
> Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
> concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
> That complete name is a Dutch lawname and also be an entry of the 
> _nl_law_names.csv_ file.
> When we try to match it and we have the ignoreCase flag to false, it is no 
> problem and fast.. If we toggle that flag to true (case is ignored), the 
> matching is really slow or even hanging in an infinitive loop.
> I debugged the code and pinpoint me to the _TreeWordList_ class. The 
> recursive method _recursiveContains_ have a potential bug. 
> I think that the problem is when the item have a special character, that it 
> is the same character in upper and lowercase. The recursive method will then 
> look/fork twice on the same tree item.
> I made a fix that check if the uppercase is the same character as the 
> lowercase, and in that case it only do the recursive call once. That solved 
> the (performance) issue but I'm not sure if this is really the main problem 
> and the current fix is the best 

[jira] [Created] (UIMA-5775) Performance problem MARKTABLE when matching case insensitive

2018-05-14 Thread Jasper Huzen (JIRA)
Jasper Huzen created UIMA-5775:
--

 Summary: Performance problem MARKTABLE when matching case 
insensitive
 Key: UIMA-5775
 URL: https://issues.apache.org/jira/browse/UIMA-5775
 Project: UIMA
  Issue Type: Bug
  Components: Ruta
Affects Versions: 2.6.1ruta
Reporter: Jasper Huzen


Hi,

We encounter a performance issue (or maybe infinitive loop) when we use the 
MARKTABLE action, with case insenstive valuelists.

The call in our script is:
{code:java}
ADDRETAINTYPE(WS);
MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, 
"lawIdentifier" = 2);{code}

Using the following input fragment will result in a timeout exception after 1 
minute.
{code:java}
Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, 
concurrerende en continu geleverde energie voor Europa {SEC(2006)317}{code}
That complete name is a Dutch lawname and also be an entry of the 
_nl_law_names.csv_ file. 

When we try to match it and we have the ignoreCase flag to false, it is no 
problem and fast.. If we toggle that flag to true (case is ignored), the 
matching is really slow or even hanging in an infinitive loop.

I debugged the code and pinpoint me to the _TreeWordList_ class. The recursive 
method _recursiveContains_ have a potential bug. 

I think that the problem is when the item have a special character, that it is 
the same character in upper and lowercase. The recursive method will then 
look/fork twice on the same tree item.

I made a fix that check if the uppercase is the same character as the 
lowercase, and in that case it only do the recursive call once. That solved the 
performance issue but I'm not sure if this is really the main problem and the 
current fix is the best fix for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5758) csvSeparator parameter is missing in basicengine.xml template

2018-04-06 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5758:
---
Description: 
The basicengine.xml (template in the core package) is missing the csvSeparator 
parameter. Without this parameter in the engine file the option don't work when 
configuring an engine with the CONFIGURE action.

Because the basicengine.xml file is used as template when generating xml engine 
descriptors the parameter should be available in this file.

Issue has relation to change UIMA-5736 that is added to the 
[2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
 version. 

Patch file attached that will add the missing part in the file.

  was:
The basicengine.xml (template in the core package) is missing the csvSeparator 
parameter. Without this parameter in the engine file the option don't work when 
configuring an engine via the CONFIGURE action.

Because the basicengine.xml file is used as template when generating xml engine 
descriptors the parameter should be available in this file.

Issue has relation to change UIMA-5736 that is added to the 
[2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
 version. 

Patch file attached that will change the file.


> csvSeparator parameter is missing in basicengine.xml template
> -
>
> Key: UIMA-5758
> URL: https://issues.apache.org/jira/browse/UIMA-5758
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.2ruta
>Reporter: Jasper Huzen
>Priority: Minor
>  Labels: patch
> Attachments: UIMA-5758.diff
>
>
> The basicengine.xml (template in the core package) is missing the 
> csvSeparator parameter. Without this parameter in the engine file the option 
> don't work when configuring an engine with the CONFIGURE action.
> Because the basicengine.xml file is used as template when generating xml 
> engine descriptors the parameter should be available in this file.
> Issue has relation to change UIMA-5736 that is added to the 
> [2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
>  version. 
> Patch file attached that will add the missing part in the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5758) csvSeparator parameter is missing in basicengine.xml template

2018-04-06 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5758:
---
Labels: patch  (was: )

> csvSeparator parameter is missing in basicengine.xml template
> -
>
> Key: UIMA-5758
> URL: https://issues.apache.org/jira/browse/UIMA-5758
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.2ruta
>Reporter: Jasper Huzen
>Priority: Minor
>  Labels: patch
> Attachments: UIMA-5758.diff
>
>
> The basicengine.xml (template in the core package) is missing the 
> csvSeparator parameter. Without this parameter in the engine file the option 
> don't work when configuring an engine via the CONFIGURE action.
> Because the basicengine.xml file is used as template when generating xml 
> engine descriptors the parameter should be available in this file.
> Issue has relation to change UIMA-5736 that is added to the 
> [2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
>  version. 
> Patch file attached that will change the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5758) csvSeparator parameter is missing in basicengine.xml template

2018-04-06 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5758:
---
Affects Version/s: 2.6.2ruta

> csvSeparator parameter is missing in basicengine.xml template
> -
>
> Key: UIMA-5758
> URL: https://issues.apache.org/jira/browse/UIMA-5758
> Project: UIMA
>  Issue Type: Bug
>Affects Versions: 2.6.2ruta
>Reporter: Jasper Huzen
>Priority: Minor
> Attachments: UIMA-5758.diff
>
>
> The basicengine.xml (template in the core package) is missing the 
> csvSeparator parameter. Without this parameter in the engine file the option 
> don't work when configuring an engine via the CONFIGURE action.
> Because the basicengine.xml file is used as template when generating xml 
> engine descriptors the parameter should be available in this file.
> Issue has relation to change UIMA-5736 that is added to the 
> [2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
>  version. 
> Patch file attached that will change the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5758) csvSeparator parameter is missing in basicengine.xml template

2018-04-06 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5758:
---
Description: 
The basicengine.xml (template in the core package) is missing the csvSeparator 
parameter. Without this parameter in the engine file the option don't work when 
configuring an engine via the CONFIGURE action.

Because the basicengine.xml file is used as template when generating xml engine 
descriptors the parameter should be available in this file.

Issue has relation to change UIMA-5736 that is added to the 
[2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
 version. 

Patch file attached that will change the file.

  was:
The basicengine.xml (template in the core package) is missing the csvSeparator 
parameter. Without this parameter in the engine file the option don't work when 
configuring an engine via the CONFIGURE action.

Because the basicengine.xml file is used as template when generating xml engine 
descriptors the parameter should be available in this file.

Issue has relation to change UIMA-5736 that is added to the 
[2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
 version. 


> csvSeparator parameter is missing in basicengine.xml template
> -
>
> Key: UIMA-5758
> URL: https://issues.apache.org/jira/browse/UIMA-5758
> Project: UIMA
>  Issue Type: Bug
>Reporter: Jasper Huzen
>Priority: Minor
> Attachments: UIMA-5758.diff
>
>
> The basicengine.xml (template in the core package) is missing the 
> csvSeparator parameter. Without this parameter in the engine file the option 
> don't work when configuring an engine via the CONFIGURE action.
> Because the basicengine.xml file is used as template when generating xml 
> engine descriptors the parameter should be available in this file.
> Issue has relation to change UIMA-5736 that is added to the 
> [2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
>  version. 
> Patch file attached that will change the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5758) csvSeparator parameter is missing in basicengine.xml template

2018-04-06 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5758:
---
Attachment: UIMA-5758.diff

> csvSeparator parameter is missing in basicengine.xml template
> -
>
> Key: UIMA-5758
> URL: https://issues.apache.org/jira/browse/UIMA-5758
> Project: UIMA
>  Issue Type: Bug
>Reporter: Jasper Huzen
>Priority: Minor
> Attachments: UIMA-5758.diff
>
>
> The basicengine.xml (template in the core package) is missing the 
> csvSeparator parameter. Without this parameter in the engine file the option 
> don't work when configuring an engine via the CONFIGURE action.
> Because the basicengine.xml file is used as template when generating xml 
> engine descriptors the parameter should be available in this file.
> Issue has relation to change UIMA-5736 that is added to the 
> [2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
>  version. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (UIMA-5758) csvSeparator parameter is missing in basicengine.xml template

2018-04-06 Thread Jasper Huzen (JIRA)
Jasper Huzen created UIMA-5758:
--

 Summary: csvSeparator parameter is missing in basicengine.xml 
template
 Key: UIMA-5758
 URL: https://issues.apache.org/jira/browse/UIMA-5758
 Project: UIMA
  Issue Type: Bug
Reporter: Jasper Huzen


The basicengine.xml (template in the core package) is missing the csvSeparator 
parameter. Without this parameter in the engine file the option don't work when 
configuring an engine via the CONFIGURE action.

Because the basicengine.xml file is used as template when generating xml engine 
descriptors the parameter should be available in this file.

Issue has relation to change UIMA-5736 that is added to the 
[2.6.2ruta|https://issues.apache.org/jira/issues/?jql=project+%3D+UIMA+AND+fixVersion+%3D+2.6.2ruta]
 version. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5752:
---
Description: 
The change / fix in UIMA-4556 cause some problems when using a CSV file with 
whitespaces.

When we have a dictionary with whitespaces between words and

>> Param PARAM_DICT_REMOVE_WS is TRUE:

When WS are visible in the token stream:
 - words with spacers are not recognized (as expected).

When WS are NOT visible in the token stream:
 - all items in the dictionary will be recognized
 - all items will also be recognized if you add whitespaces between words. For 
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.

>> Param PARAM_DICT_REMOVE_WS is FALSE:

When WS are visible in the token stream:
 - not all entries in the dictionary will be recognized

When WS are NOT visible in the token stream:
 - also not all entries in the dictionary will be recognized

The problem that this cause is that the default value to ignore whitespaces is 
always true (hardcoded).
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are 
important) that won't  work. The matcher should use the same value as set in 
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
method.

-I attached a patch to fix this issue.-

I'm working on a patch.

  was:
The change / fix in UIMA-4556 cause some problems when using a CSV file with 
whitespaces.

When we have a dictionary with whitespaces between words and

>> Param PARAM_DICT_REMOVE_WS is TRUE:

When WS are visible in the token stream:
 - words with spacers are not recognized (as expected).

When WS are NOT visible in the token stream:
 - all items in the dictionary will be recognized
 - all items will also be recognized if you add whitespaces between words. For 
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.

>> Param PARAM_DICT_REMOVE_WS is FALSE:

When WS are visible in the token stream:
 - not all entries in the dictionary will be recognized

When WS are NOT visible in the token stream:
 - also not all entries in the dictionary will be recognized



The problem that this cause is that the default value to ignore whitespaces is 
always true (hardcoded).
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are 
important) that won't  work. The matcher should use the same value as set in 
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
method.

I attached a patch to fix this issue.


> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405055#comment-16405055
 ] 

Jasper Huzen edited comment on UIMA-5752 at 3/19/18 8:16 PM:
-

Patch removed because it was not complete. 


was (Author: feaster83):
Patch is not complete ---> MarkFastAction and others should also be fixed. (I 
will look to that)

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5752:
---
Attachment: (was: UIMA-5752.patch)

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> I attached a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405055#comment-16405055
 ] 

Jasper Huzen edited comment on UIMA-5752 at 3/19/18 4:21 PM:
-

Patch is not complete ---> MarkFastAction and others should also be fixed. (I 
will look to that)


was (Author: feaster83):
Patch is not complete ---> MarkFastAction should also be fixed. 

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5752.patch
>
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> I attached a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405055#comment-16405055
 ] 

Jasper Huzen commented on UIMA-5752:


Patch is not complete ---> MarkFastAction should also be fixed. 

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5752.patch
>
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> I attached a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5752:
---
Attachment: UIMA-5752.patch

> Problem with matching items in MarkTable with whitespacers visible
> --
>
> Key: UIMA-5752
> URL: https://issues.apache.org/jira/browse/UIMA-5752
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Major
> Attachments: UIMA-5752.patch
>
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with 
> whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. 
> For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces 
> is always true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are 
> important) that won't  work. The matcher should use the same value as set in 
> the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
> method.
> I attached a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible

2018-03-19 Thread Jasper Huzen (JIRA)
Jasper Huzen created UIMA-5752:
--

 Summary: Problem with matching items in MarkTable with 
whitespacers visible
 Key: UIMA-5752
 URL: https://issues.apache.org/jira/browse/UIMA-5752
 Project: UIMA
  Issue Type: Bug
  Components: Ruta
Affects Versions: 2.6.1ruta
Reporter: Jasper Huzen


The change / fix in UIMA-4556 cause some problems when using a CSV file with 
whitespaces.

When we have a dictionary with whitespaces between words and
 * Param PARAM_DICT_REMOVE_WS is TRUE:

When WS are visible in the token stream:
- words with spacers are not recognized (as expected).

When WS are NOT visible in the token stream:
- all items in the dictionary will be recognized
- all items will also be recognized if you add whitespaces between words. For 
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
 * Param PARAM_DICT_REMOVE_WS is FALSE:

When WS are visible in the token stream:
- not all entries in the dictionary will be recognized

When WS are NOT visible in the token stream:
- also not all entries in the dictionary will be recognized


The problem that this cause is that the default value to ignore whitespaces is 
always true (hardcoded).
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are 
important) that won't  work. The matcher should use the same value as set in 
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
method.

I attached a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (UIMA-5723) MARKTABLE fails to assign feature for single word entry in first CSV column

2018-03-19 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5723:
---
Comment: was deleted

(was: The change / fix in UIMA-4556 cause some problems when using a CSV file 
with whitespaces.

When setting param PARAM_DICT_REMOVE_WS to TRUE and don't have WS visible in 
the token stream:
- all items in the dictionary will be recognized
- all items will also be recognized if you add whitespaces between words. For 
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.

If whitespaces are visible, words with spacers won't be recognized. 

The problem that this cause is that the default hardcored value to ignore 
whitespaces is always true:
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}

This is not correct because if you want to use whitespaces (if they are 
important) that won't be work. This matcher should use the same value as set in 
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
method.

I attached a patch to fix this issue. [^UIMA-5723.patch])

> MARKTABLE fails to assign feature for single word entry in first CSV column
> ---
>
> Key: UIMA-5723
> URL: https://issues.apache.org/jira/browse/UIMA-5723
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Andreas Thiel
>Assignee: Peter Klügl
>Priority: Major
>
> When using Ruta's MARKTABLE action with a CSV file {{nl_law_names.csv}} like 
> this
> {code:xml}
> WAZ;WAZELF
> Wet arbeidsongeschiktheidsverzekering zelfstandigen;WAZELF
> {code}
> and corresponding Ruta script containing these lines
> {code:java}
> WORDTABLE LawNameTable = 'nl_law_names.csv';
> Document{->MARKTABLE(WetNaam, 1, LawNameTable, "WetIdentifier" = 2)};
> {code}
> it seems that the text {{WAZ}} is detected, but the {{WetIdentifier}} feature 
> of the resulting annotation is not filled by the string following the 
> semicolon. Instead, it remains empty.
> (Note: _WetNaam_ annotation is defined elsewhere via type system description)
> In contrast, the fully written name {{Wet arbeidsongeschiktheidsverzekering 
> zelfstandigen}} is detected and processed as expected with feature 
> WetIdentifier = WAZELF after annnotating.
> Could it be that problems arise when only a single word (i.e. no spaces or 
> uppercase letters following lowercase chars) is present in the first column 
> in the CSV file? Or is it a matter of configuration?
> We experimented also with the optional arguments of MARKTABLE regarding 
> uppercase/lowercase distinction, but to no avail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5723) MARKTABLE fails to assign feature for single word entry in first CSV column

2018-03-19 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5723:
---
Attachment: (was: UIMA-5723.patch)

> MARKTABLE fails to assign feature for single word entry in first CSV column
> ---
>
> Key: UIMA-5723
> URL: https://issues.apache.org/jira/browse/UIMA-5723
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Andreas Thiel
>Assignee: Peter Klügl
>Priority: Major
>
> When using Ruta's MARKTABLE action with a CSV file {{nl_law_names.csv}} like 
> this
> {code:xml}
> WAZ;WAZELF
> Wet arbeidsongeschiktheidsverzekering zelfstandigen;WAZELF
> {code}
> and corresponding Ruta script containing these lines
> {code:java}
> WORDTABLE LawNameTable = 'nl_law_names.csv';
> Document{->MARKTABLE(WetNaam, 1, LawNameTable, "WetIdentifier" = 2)};
> {code}
> it seems that the text {{WAZ}} is detected, but the {{WetIdentifier}} feature 
> of the resulting annotation is not filled by the string following the 
> semicolon. Instead, it remains empty.
> (Note: _WetNaam_ annotation is defined elsewhere via type system description)
> In contrast, the fully written name {{Wet arbeidsongeschiktheidsverzekering 
> zelfstandigen}} is detected and processed as expected with feature 
> WetIdentifier = WAZELF after annnotating.
> Could it be that problems arise when only a single word (i.e. no spaces or 
> uppercase letters following lowercase chars) is present in the first column 
> in the CSV file? Or is it a matter of configuration?
> We experimented also with the optional arguments of MARKTABLE regarding 
> uppercase/lowercase distinction, but to no avail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (UIMA-5723) MARKTABLE fails to assign feature for single word entry in first CSV column

2018-03-19 Thread Jasper Huzen (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404876#comment-16404876
 ] 

Jasper Huzen commented on UIMA-5723:


The change / fix in UIMA-4556 cause some problems when using a CSV file with 
whitespaces.

When setting param PARAM_DICT_REMOVE_WS to TRUE and don't have WS visible in 
the token stream:
- all items in the dictionary will be recognized
- all items will also be recognized if you add whitespaces between words. For 
example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.

If whitespaces are visible, words with spacers won't be recognized. 

The problem that this cause is that the default hardcored value to ignore 
whitespaces is always true:
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}

This is not correct because if you want to use whitespaces (if they are 
important) that won't be work. This matcher should use the same value as set in 
the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS 
method.

I attached a patch to fix this issue. [^UIMA-5723.patch]

> MARKTABLE fails to assign feature for single word entry in first CSV column
> ---
>
> Key: UIMA-5723
> URL: https://issues.apache.org/jira/browse/UIMA-5723
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Andreas Thiel
>Assignee: Peter Klügl
>Priority: Major
> Attachments: UIMA-5723.patch
>
>
> When using Ruta's MARKTABLE action with a CSV file {{nl_law_names.csv}} like 
> this
> {code:xml}
> WAZ;WAZELF
> Wet arbeidsongeschiktheidsverzekering zelfstandigen;WAZELF
> {code}
> and corresponding Ruta script containing these lines
> {code:java}
> WORDTABLE LawNameTable = 'nl_law_names.csv';
> Document{->MARKTABLE(WetNaam, 1, LawNameTable, "WetIdentifier" = 2)};
> {code}
> it seems that the text {{WAZ}} is detected, but the {{WetIdentifier}} feature 
> of the resulting annotation is not filled by the string following the 
> semicolon. Instead, it remains empty.
> (Note: _WetNaam_ annotation is defined elsewhere via type system description)
> In contrast, the fully written name {{Wet arbeidsongeschiktheidsverzekering 
> zelfstandigen}} is detected and processed as expected with feature 
> WetIdentifier = WAZELF after annnotating.
> Could it be that problems arise when only a single word (i.e. no spaces or 
> uppercase letters following lowercase chars) is present in the first column 
> in the CSV file? Or is it a matter of configuration?
> We experimented also with the optional arguments of MARKTABLE regarding 
> uppercase/lowercase distinction, but to no avail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5723) MARKTABLE fails to assign feature for single word entry in first CSV column

2018-03-19 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5723:
---
Attachment: UIMA-5723.patch

> MARKTABLE fails to assign feature for single word entry in first CSV column
> ---
>
> Key: UIMA-5723
> URL: https://issues.apache.org/jira/browse/UIMA-5723
> Project: UIMA
>  Issue Type: Bug
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Andreas Thiel
>Assignee: Peter Klügl
>Priority: Major
> Attachments: UIMA-5723.patch
>
>
> When using Ruta's MARKTABLE action with a CSV file {{nl_law_names.csv}} like 
> this
> {code:xml}
> WAZ;WAZELF
> Wet arbeidsongeschiktheidsverzekering zelfstandigen;WAZELF
> {code}
> and corresponding Ruta script containing these lines
> {code:java}
> WORDTABLE LawNameTable = 'nl_law_names.csv';
> Document{->MARKTABLE(WetNaam, 1, LawNameTable, "WetIdentifier" = 2)};
> {code}
> it seems that the text {{WAZ}} is detected, but the {{WetIdentifier}} feature 
> of the resulting annotation is not filled by the string following the 
> semicolon. Instead, it remains empty.
> (Note: _WetNaam_ annotation is defined elsewhere via type system description)
> In contrast, the fully written name {{Wet arbeidsongeschiktheidsverzekering 
> zelfstandigen}} is detected and processed as expected with feature 
> WetIdentifier = WAZELF after annnotating.
> Could it be that problems arise when only a single word (i.e. no spaces or 
> uppercase letters following lowercase chars) is present in the first column 
> in the CSV file? Or is it a matter of configuration?
> We experimented also with the optional arguments of MARKTABLE regarding 
> uppercase/lowercase distinction, but to no avail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (UIMA-5736) Add option to CSVTable to use custom column separator

2018-03-15 Thread Jasper Huzen (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400272#comment-16400272
 ] 

Jasper Huzen commented on UIMA-5736:


Thanks (y)

> Add option to CSVTable to use custom column separator
> -
>
> Key: UIMA-5736
> URL: https://issues.apache.org/jira/browse/UIMA-5736
> Project: UIMA
>  Issue Type: Improvement
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Assignee: Peter Klügl
>Priority: Minor
>  Labels: patch
> Fix For: 2.6.1ruta
>
> Attachments: UIMA-5736.patch
>
>
> We need to use a custom separator in the CSV files because the default 
> separator (semicolon) is also used in sentences.
> There are more people asking for this feature: 
> [https://stackoverflow.com/questions/45647512/how-can-i-change-the-seperator-option-in-wordtable-uima-ruta]
> I already changed the Ruta implementation to make this feature available. See 
> attachment for changes and testcases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5736) Add option to CSVTable to use custom column separator

2018-03-02 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5736:
---
Attachment: UIMA-5736.patch

> Add option to CSVTable to use custom column separator
> -
>
> Key: UIMA-5736
> URL: https://issues.apache.org/jira/browse/UIMA-5736
> Project: UIMA
>  Issue Type: Improvement
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Minor
>  Labels: patch
> Attachments: UIMA-5736.patch
>
>
> We need to use a custom separator in the CSV files because the default 
> separator (semicolon) is also used in sentences.
> There are more people asking for this feature: 
> [https://stackoverflow.com/questions/45647512/how-can-i-change-the-seperator-option-in-wordtable-uima-ruta]
> I already changed the Ruta implementation to make this feature available. See 
> attachment for changes and testcases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (UIMA-5736) Add option to CSVTable to use custom column separator

2018-03-02 Thread Jasper Huzen (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Huzen updated UIMA-5736:
---
Attachment: (was: Add option to CSVTable to use custom column 
separator.patch)

> Add option to CSVTable to use custom column separator
> -
>
> Key: UIMA-5736
> URL: https://issues.apache.org/jira/browse/UIMA-5736
> Project: UIMA
>  Issue Type: Improvement
>  Components: Ruta
>Affects Versions: 2.6.1ruta
>Reporter: Jasper Huzen
>Priority: Minor
>  Labels: patch
> Attachments: UIMA-5736.patch
>
>
> We need to use a custom separator in the CSV files because the default 
> separator (semicolon) is also used in sentences.
> There are more people asking for this feature: 
> [https://stackoverflow.com/questions/45647512/how-can-i-change-the-seperator-option-in-wordtable-uima-ruta]
> I already changed the Ruta implementation to make this feature available. See 
> attachment for changes and testcases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (UIMA-5736) Add option to CSVTable to use custom column separator

2018-03-02 Thread Jasper Huzen (JIRA)
Jasper Huzen created UIMA-5736:
--

 Summary: Add option to CSVTable to use custom column separator
 Key: UIMA-5736
 URL: https://issues.apache.org/jira/browse/UIMA-5736
 Project: UIMA
  Issue Type: Improvement
  Components: Ruta
Affects Versions: 2.6.1ruta
Reporter: Jasper Huzen
 Attachments: Add option to CSVTable to use custom column 
separator.patch

We need to use a custom separator in the CSV files because the default 
separator (semicolon) is also used in sentences.

There are more people asking for this feature: 
[https://stackoverflow.com/questions/45647512/how-can-i-change-the-seperator-option-in-wordtable-uima-ruta]

I already changed the Ruta implementation to make this feature available. See 
attachment for changes and testcases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)