Re: files vs strings in collection reader

Tim Miller Wed, 29 May 2013 10:27:35 -0700

This collection reader latency issue was harder to test than expected --the first run took ~20 minutes to load and the second took a negligibleamount of time, presumably due to caching effects. But given our otherconversation on a "big data" direction using UIMA-AS there is apotential solution out there.

UIMA-AS doesn't require Collection Readers -- you just deploy somenumber of pipelines, and then can write a bit of code that can createand add CAS's to a queue, asynchronously if desired. So when we getsomething like that up and running, then we can give users/devs a ruleof thumb that says if you're regularly processing more than ~10kdocuments it's probably better to use UIMA-AS anyways, and then you'llget the benefits of the asynchronous methods.


Tim

On 05/07/2013 03:49 PM, Tim Miller wrote:

This sounds like a job for... science! I'll try some experiments andsee if it makes a difference.
Tim

On 05/07/2013 03:42 PM, Masanz, James J. wrote:
do you have any numbers of what sort of impact this will actuallyhave? Not clear to me what the savings would be from. Instantiatingobjects either way. Should we be just initializing the ArrayList tosomething other than the default size?
-- James
-----Original Message-----
From: [email protected][mailto:dev-
[email protected]] On Behalf Of Tim
Miller
Sent: Tuesday, May 07, 2013 2:18 PM
To: [email protected]
Subject: files vs strings in collection reader
The FilesInDirectoryCollectionReader creates an arraylist ofjava.io.File
objects when it is initialized. For large datasets (~50k
files) this is substantial time overhead and probably memory as well.
Seems like it would be more efficient to use Strings instead of Files
there and just open the File object when getNext() is called. It ispretty
easy to implement, any downside to making this switch?
Tim

Re: files vs strings in collection reader

Reply via email to