This collection reader latency issue was harder to test than expected --
the first run took ~20 minutes to load and the second took a negligible
amount of time, presumably due to caching effects. But given our other
conversation on a "big data" direction using UIMA-AS there is a
potential solution out there.
UIMA-AS doesn't require Collection Readers -- you just deploy some
number of pipelines, and then can write a bit of code that can create
and add CAS's to a queue, asynchronously if desired. So when we get
something like that up and running, then we can give users/devs a rule
of thumb that says if you're regularly processing more than ~10k
documents it's probably better to use UIMA-AS anyways, and then you'll
get the benefits of the asynchronous methods.
Tim
On 05/07/2013 03:49 PM, Tim Miller wrote:
This sounds like a job for... science! I'll try some experiments and
see if it makes a difference.
Tim
On 05/07/2013 03:42 PM, Masanz, James J. wrote:
do you have any numbers of what sort of impact this will actually
have? Not clear to me what the savings would be from. Instantiating
objects either way. Should we be just initializing the ArrayList to
something other than the default size?
-- James
-----Original Message-----
From: [email protected]
[mailto:dev-
[email protected]] On Behalf Of Tim
Miller
Sent: Tuesday, May 07, 2013 2:18 PM
To: [email protected]
Subject: files vs strings in collection reader
The FilesInDirectoryCollectionReader creates an arraylist of
java.io.File
objects when it is initialized. For large datasets (~50k
files) this is substantial time overhead and probably memory as well.
Seems like it would be more efficient to use Strings instead of Files
there and just open the File object when getNext() is called. It is
pretty
easy to implement, any downside to making this switch?
Tim