Re: Collection processing

Drenski Mon, 15 Nov 2010 02:36:52 -0800

Eddie Epstein <eaepst...@...> writes:

> 
> Is the analysis of each document to be done independently of
> the others? For example, annotation offsets are relative to the
> beginning of each document. If not, the documents can be
> concatenated together and analyzed at the same time.
> 
> If the documents are to be considered independently, the
> annotator has to process each separately. One could
> create a view for each document and let the annotator
> iterate over all views. Of course since the CAS is memory
> resident there is a natural limit to the total size of all
> documents to be processed in this way.
> 
> On Sun, Nov 14, 2010 at 10:10 AM, Drenski <milen_dren...@...> wrote:
> > Hi,
> > I am new to UIMA and i have been struggling for some time
> > with the following problem.
> > I have some documents, which i need to process simultaneously.
> > So I implemented a collection reader, which reads all the files
> > from a directory and annotates them as Documents. But how can
> > i put these all files in an Array for example so that I can
> > iterate them and make my further processing. Basically I
> > just want to fetch the files from the directory and put
> > them in an array so that i can process them.
> > Is CAS consumer what I need? I saw in the doc that
> > it is now deprecated. Or should I use some index like Lucene?
> > But I guess this will be too complex for my simple task?
> > I would appreciate any suggestions.
> > Regards,
> > Drenski
> >
> >
> 
>


Thank you for your reply!
My goal is to do some clustering of those
documents. As input for this clustering 
i need a list of feature vectors and each
feature vector represents a single 
document. I implemented the clustering as
an annotator. So my first guess was to use
a collection reader to read these documents 
and put each document in a list which i can 
use for the clustering. But i can't figure out 
where and how to store those documents, so that
i can use them after all of them are read, 
because the collection reader reads one document
and then sends it to the annotator. 
Regards,
Drenski

Re: Collection processing

Reply via email to