Ken Krugler
Fri, 15 Jan 2010 11:20:06 -0800
On Jan 15, 2010, at 11:07am, Doug Carter wrote:
Hi all, This may be off-topic for this list, but I need to start somewhere. I need a command line utility to do document format conversion, in abatch mode environment. The batch process is a combination of steps, oneof which is the actual format conversion which is currently being doneby a collection of Linux binary converters like wvWare, pdftohtml, etc.I've put a shell script wrapper around the tika jar: java -jar tika-app.jar [infile] > [outfile] This works OK, but as you would imagine, it is much slower compared to a Linux binary. Does anyone know of a way to improve the performance in a setup likethis? I know it goes against the whole philosophy of Java, but is there a way to compile the Tika jar byte code into a native Linux binary? I'vetaken a look at gcj, but it doesn't look like a simple re-compile. Any ideas would be greatly appreciated.
If you have a set of documents, easiest would be to pass in a directory to tika-app (extend it a bit) so that one invocation of the JVM processes many documents.
-- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g