Hi,
Thanks for your question. Yes, the same way you set the byte size property in Tika-App (I think through parser configuration) is how you would do it for Tika-Server. You would just start the Tika Server yourself with a custom config file that set this property and then start it on the default port (making sure any other ones were killed first). Then Tika-Python will use your own Tika Server with custom config. As for catching errors, it will try its best to do that, but it does not catch all of them and if you find something it doesn’t catch let us know and we will work to fix it. Thanks, Chris From: "[email protected]" <[email protected]> Organization: Avident-IT Date: Tuesday, October 8, 2019 at 6:06 AM To: "Mattmann, Chris A (US 1761)" <[email protected]> Subject: [EXTERNAL] Tika Python questions Hi I have had the pleasure of testing the Tika-python library. I am testing it out in a new application that are developed for customers. It has very good performance, especially for parsing XLSX and XLS files. However, I have two questions: The Tika-Server handles only files with a maximum byte size. I get this error: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1186956, but 1000000 is the maximum for this record type. increasing the maximum allowable size for this record type. As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() I have tried the Tika-App python (jar file) and it does handle the file size where files are larger than 1000000. In the Tika documentation it says to set MaxBytes to -1 to override and handle larger files. Is there any way to handle this via Tika-Python? To set max files size to unlimited as the “Tika-App” handles it? How is it possible to catch errors via the Tika-python library, like if files are encrypted, corrupt etc.? Kind regards HANS MEIJER
