Hello, I have sent this to the Tika Linked before and got an answer from Jukka Zitting,
It may be that the PDFBox library Tika uses for handling PDF documents is having a problem with parsing your files. Do you have an example file that you can share? BR, so here is the original mail and attachment. PDF file 1: https://docs.google.com/fileview?id=0B2X-v8a_ekanYmMyMzg1NTktMmFlMi00YjU2LTk2OWQtMTg2NTI1YWI4NTZh&hl=en PDF https://docs.google.com/fileview?id=0B2X-v8a_ekanMTUyNjExMjUtMTI5Yy00NDc4LTg0YmYtODg4NmNkMGIxMmZk&hl=en I'm trying to parse a pdf file. I first tried this code InputStream input = new FileInputStream(new File(resourceLocation));// the document to be parsed ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); PDFParser parser = new PDFParser(); ParseContext context = new ParseContext(); parser.parse(input, textHandler, metadata, context); input.close(); then I tried the Tika class Tika tika = new Tika(); InputStream input = new FileInputStream(new File(resourceLocation)); Metadata metadata = new Metadata(); String content = tika.parseToString(input, metadata); both of these codes do the exact same thing, they read some of the text in the PDF file, but leave the rest of the file out?? I tested it with a 1m file and a 100k file. I looked around and found this message in the tika mails "Tika maxStringLength limit reached" where it was suggested that one could add the maxStringLength by doing this tika.setMaxStringLength(10* 1024*1024); no result. Am I doing something wrong?how can I parse the entire file. cheers ehsan
