Probably better question for the user list. Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward.
Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler? -----Original Message----- From: ruby [mailto:[email protected]] Sent: Thursday, August 28, 2014 2:07 PM To: [email protected] Subject: TIKA - how to read chunks at a time from a very large file? Using ContentHandler is there a way to read chunks at a time from a very large file (over 5GB). Right now I'm doing following to read the entire content at once: InputStream stream = new FileInputStream(file); Parser p = new AutoDetectParser(); Metadata meta =new Metadata(); WriteOutContentHandler handler = new WriteOutContnetHandler(-1); ParseContext parse = new ParseContext(); p.parse(stream,handler,meta, context); String content = handler.toString(); Since the files contain over 5GB data, the content string here will end up too much data in memory. I want to avoid this and want to read chunk at a time. I tried ParsingReader and I can read chunks using this but we are splitting on words. Some of the files have Chinese/Japanese words, so we can't process using white-spaces either. -- View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
