My belief in making that recommendation was that a given document wouldn't 
split a word across an "element".  I can, of course, think of exceptions (word 
break at the end of a PDF page, for example), but generally, my assumption is 
that this wouldn't happen very often.  However, if this does happen often with 
your documents, or if a single element is too large to hold in memory, then 
that recommendation won't work, and you'll probably have to write to disk.

________________________________________
From: ruby [[email protected]]
Sent: Thursday, August 28, 2014 3:26 PM
To: [email protected]
Subject: Re: TIKA - how to read chunks at a time from a very large file?

If I extend the ContentHandler then is there way to make sure that I don't
split on words?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644p4155673.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to