Hello Tim Allison, I am currently working on improving Tika's OCR capabilities. After suggestion from Thamme Gowda (@thammegowda <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda>), I started to work on comparison of Tesseract 4.0's neural network <https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00> subsystem and Visual Geometry Group's (VGG) models <http://www.robots.ox.ac.uk/~vgg/research/text/>.
It would be great if you provide the dataset to test the OCR as you mentioned in one of the issues. I would be comparing their running time for evaluation, accuracy, memory consumed and invariance to lighting, orientation, etc. And then I would be integrating the appropriate models into Tika's OCR. Thank you, Kranthi Kiran GV, CS 3/4 Undergrad, NIT Warangal
