My belief in making that recommendation was that a given document wouldn't split a word across an "element". I can, of course, think of exceptions (word break at the end of a PDF page, for example), but generally, my assumption is that this wouldn't happen very often. However, if this does happen often with your documents, or if a single element is too large to hold in memory, then that recommendation won't work, and you'll probably have to write to disk.
________________________________________ From: ruby [[email protected]] Sent: Thursday, August 28, 2014 3:26 PM To: [email protected] Subject: Re: TIKA - how to read chunks at a time from a very large file? If I extend the ContentHandler then is there way to make sure that I don't split on words? -- View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644p4155673.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
