[jira] [Updated] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Rinat Gareyev (JIRA) Tue, 14 Aug 2012 09:54:39 -0700

     [ 
https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rinat Gareyev updated UIMA-2455:
--------------------------------

    Description: 
Example rule:
A B C{NOT(PARTOF(D))->MARK(D,3)};

Example text:
aText bText cText cMoreText

where following correspondence between annotations and tokens are held:
A = aText
B = bText
C = cText
C = cText cMoreText

Rule results in the following:
D = cText

However I expect that:
D = cText cMoreText

The reason of actual behaviour is 
org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It 
returns a shorter annotation before longer. That is why the sequence 'aText 
bText cText' will be matched and sequence 'aText bText cText cMoreText' will 
not because it will be considered later and will not pass NOT PARTOF condition.

I've revealed this after migration to the latest TextMarker sources (from ASF 
repo). Before we used the one from Sourceforge.net. In the old (sourceforge) 
version this problem did not arise because TextMarkerBasic could keep only one 
annotation per Type as 'begin anchor'. Returning to the example this means that 
'cText' TextMarkerBasic held only one C annotation as begin anchor.

In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end 
anchors per Type. This is actually a good improvement.
But I suggest to make ordering of anchored annotations returned by 
TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, 
TextMarkerStream) method more controllable.
E.g., by adding some parameter for TextMarkerEngine or script which will define 
AnnotationComparator#compare implementation.

Also returning longer annotations before shorter ones seems to be more 
compliant to the UIMA default indexing. See 
http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

  was:
Example rule:
A B C{-PARTOF(D)->MARK(D,3)};

Example text:
aText bText cText cMoreText

where following correspondence between annotations and tokens are held:
A = aText
B = bText
C = cText
C = cText cMoreText

Rule results in the following:
D = cText

However I expect that:
D = cText cMoreText

The reason of actual behaviour is 
org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It 
returns a shorter annotation before longer. That is why the sequence 'aText 
bText cText' will be matched and sequence 'aText bText cText cMoreText' will 
not because it will be considered later and will not pass NOT PARTOF condition.

I've revealed this after migration to the latest TextMarker sources (from ASF 
repo). Before we used the one from Sourceforge.net. In the old (sourceforge) 
version this problem did not arise because TextMarkerBasic could keep only one 
annotation per Type as 'begin anchor'. Returning to the example this means that 
'cText' TextMarkerBasic held only one C annotation as begin anchor.

In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end 
anchors per Type. This is actually a good improvement.
But I suggest to make ordering of anchored annotations returned by 
TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, 
TextMarkerStream) method more controllable.
E.g., by adding some parameter for TextMarkerEngine or script which will define 
AnnotationComparator#compare implementation.

Also returning longer annotations before shorter ones seems to be more 
compliant to the UIMA default indexing. See 
http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

    
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is 
> org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. 
> It returns a shorter annotation before longer. That is why the sequence 
> 'aText bText cText' will be matched and sequence 'aText bText cText 
> cMoreText' will not because it will be considered later and will not pass NOT 
> PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF 
> repo). Before we used the one from Sourceforge.net. In the old (sourceforge) 
> version this problem did not arise because TextMarkerBasic could keep only 
> one annotation per Type as 'begin anchor'. Returning to the example this 
> means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and 
> end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by 
> TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, 
> TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will 
> define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more 
> compliant to the UIMA default indexing. See 
> http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Reply via email to