On 10/3/11 3:37 PM, Eddie Epstein wrote:
As Marshall pointed out above, a CAS can have many CAS Views, each
with its own artifact. An analysis pipeline knows where these
artifacts come from and can set metadata appropriately, but a unique
ID for a stored copy of the CAS might best be determined by the
persistent CAS storage system where the CAS is to be stored.
To summarize what has been said.
A unique ID per CAS seems to be useful for logging (and debugging) in
user code, because the IDs logged by the framework can be related to IDs
logged
by user code.
A CAS ID might not work in complex type systems which use multiples
views, because
each sofa in a multi-view CAS might have a different source ID.
Beside that, there are UIMA pipelines which always store a complete CAS
object in some kind
of storage. There the CAS ID can just be the unique storage ID. This
could for example be a file
system, or an HBase row key. As pointed out this might not work for
complex cases, but could
be helpful for simpler UIMA pipelines.
Our Solrcas AE could also just use the CAS ID by default, if the user
does not specify an Document ID
Feature Structure. In my applications this would actually work quite well.
More complex applications could also decide to use mime/type, features
in a view as additional
information to complement the CAS ID in a newly created view, in order
to compute a storage ID.
For example a UIMA pipeline which translates the input document text to
english, and then stores the
new text in a new english view. The code can then compute an ID which is
based on the unique CAS ID.
In the end I believe a simple CAS ID field could be quite useful, for
debugging/logging, as a
document ID in simple UIMA pipelines and for applications which deal
with whole CASes
(e.g. the Cas Editor based annotation tooling, or an AE which extracts
"problematic" CASes
from an analysis pipeline for inspection).
To implement this I suggest that we extend to CAS interface with
CAS.setId(String) and CAS.getId() methods.
Jörn