This collection reader latency issue was harder to test than expected -- the first run took ~20 minutes to load and the second took a negligible amount of time, presumably due to caching effects. But given our other conversation on a "big data" direction using UIMA-AS there is a potential solution out there.

UIMA-AS doesn't require Collection Readers -- you just deploy some number of pipelines, and then can write a bit of code that can create and add CAS's to a queue, asynchronously if desired. So when we get something like that up and running, then we can give users/devs a rule of thumb that says if you're regularly processing more than ~10k documents it's probably better to use UIMA-AS anyways, and then you'll get the benefits of the asynchronous methods.

Tim

On 05/07/2013 03:49 PM, Tim Miller wrote:
This sounds like a job for... science! I'll try some experiments and see if it makes a difference.
Tim

On 05/07/2013 03:42 PM, Masanz, James J. wrote:
do you have any numbers of what sort of impact this will actually have? Not clear to me what the savings would be from. Instantiating objects either way. Should we be just initializing the ArrayList to something other than the default size?

-- James


-----Original Message-----
From: [email protected] [mailto:dev-
[email protected]] On Behalf Of Tim
Miller
Sent: Tuesday, May 07, 2013 2:18 PM
To: [email protected]
Subject: files vs strings in collection reader

The FilesInDirectoryCollectionReader creates an arraylist of java.io.File
objects when it is initialized. For large datasets (~50k
files) this is substantial time overhead and probably memory as well.
Seems like it would be more efficient to use Strings instead of Files
there and just open the File object when getNext() is called. It is pretty
easy to implement, any downside to making this switch?
Tim


Reply via email to