[ 
https://issues.apache.org/jira/browse/UIMA-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13683297#comment-13683297
 ] 

Jens Grivolla commented on UIMA-2670:
-------------------------------------

Yes, I know it's in the "example" section, but I'm still very surprised by its 
behavior. I understood it to mean that the CAS is the last segment of a series 
of CASes with the same document id and different source offsets (as obtained 
when segmenting a document into different CASes): "For a CAS that represents a 
segment of a larger source document, this flag indicates whether this CAS is 
the final segment of the source document."

I don't think that using it as an end-of-collection marker is compatible with 
this definition. Is there actually an example using it to mean "end of 
collection"? I have a hard time seeing a use case for this, but if it is 
actually being used this way we would look into making it configurable.

Our own code does use the lastSegment feature of SourceDocumentInformation when 
working with segments, and is therefore incompatible with the 
FileSystemCollectionReader, which is annoying when people want to use it with 
our pipelines.
                
> FileSystemCollectionReader doesn't set lastSegment correctly
> ------------------------------------------------------------
>
>                 Key: UIMA-2670
>                 URL: https://issues.apache.org/jira/browse/UIMA-2670
>             Project: UIMA
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 2.4.0SDK
>            Reporter: Jens Grivolla
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> FileSystemCollectionReader only sets lastSegment=true (in the 
> SourceDocumentInformation) on the last document. Given that it loads 
> individual documents, not segments of a document, this should be "true" for 
> each CAS that it generates.
> This is a problem when later using a CAS multiplier to segment the CAS. It 
> should be possible to check whether the CAS is a complete document or a 
> segment by testing for "offsetInSource==0 && lastSegment==true".
> in org.apache.uima.examples.cpe.FileSystemCollectionReader:166
> srcDocInfo.setLastSegment(mCurrentIndex == mFiles.size());
> should be:
> srcDocInfo.setLastSegment(true);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to