Jukka Zitting
Thu, 21 Jan 2010 16:48:04 -0800
Hi, Sorry for the late response...
On Tue, Jan 5, 2010 at 5:47 PM, Baldwin, David <david_bald...@bmc.com> wrote: > I need to get a handle on how much memory Tika needs to token-ize different > file types. In other words, I need to find information on required overhead > (including copies of buffers made if applicable) so that I can produce some > kind of guidelines for memory possibly needed by users of the product I am > working on which uses Lucene/Tika. Assuming you use Tika in streaming mode, then the memory use is moderate and typically does not depend on the size of the document being processed. Parsing complex documents like MS Office or PDF files can require up to a few megabytes of memory, while simple formats like plain text only need a few kilobytes of memory. In addition to the above estimates, you also need to take into account the memory that the JVM needs for loading all the Tika and relevant parser library classes needed. In total I'd estimate that you should get pretty far with some 20 megs of memory for Tika unless you have dozens or more parallel Tika parsing tasks running concurrently. BR, Jukka Zitting