Hi there My name is Iain Fraser and I'm a software developer from Perth in Western Australia. Recently, I was tasked with improving the parsing performance of an ASP.NET application that was using tika-app via the command line. Without getting too far into it, the solution is looking like running tika-server and bumping up the maxMainMemoryBytes parameter of the PDFParser. We needed to bump that value up because the performance of large files was unacceptably slow - and this was a sticking point for the team preventing them from using tika-server (given that tika-app exhibited no such issue for us).
I'm getting in touch today because I think I can help with documentation. I was only able to arrive at the solution I did through intense googling, reading message boards, experimentation and eventually, just reading the source code. Perhaps I can help cut that process short for someone else in the future? Specifically, I had no idea that the config XML could accept <params> elements under the <parser> element. Furthermore, I couldn't find any documentation showing the parameters available and what they did. Through my work, I have extracted a list of possible parameters for PDFParser as well as comments from the implementing developer, which I'd really like to document somewhere outside of source. I also might add something in your "Troubleshooting Tika" section about bumping up main memory when you get slow performance with large or complex PDF files (with some sample xml) and perhaps even a note to Windows users about why their config.xml files might not work when they extract them from the app via the command line (check the encoding is actually UTF-8, PowerShell outputs UTF-16). In order to do this, I would need to have create/edit access in the Tika space of your Confluence app. Could it be possible to arrange this please? I already have an account there under my name and this email address ( [email protected]) Kind regards Iain
