To confirm, Lucene does not perform OCR. (If you are looking for open source java ocr packages, you might take a look here for some ideas: https://issues.apache.org/jira/i#browse/TIKA-93). Are you trying to find a corpus of noisy OCR'd text to use as input into Lucene? If so, this looks potentially useful: http://chroniclingamerica.loc.gov/ocr/. Don't know how well its error rates match yours... -----Original Message----- From: Deniz Atak [mailto:deniza...@gmail.com] Sent: Thursday, January 16, 2014 2:43 PM To: java-user@lucene.apache.org Subject: Sample Data to Test Lucene
Hi, we are new to Lucene. We would like to use Lucene for our archive project. In this project we have to get some images of documents, get text out of them via OCR and index them using Lucene. In order to see if Lucene is suitable for our project we need to test Lucene with sample data. But we need huge data set that is composed of images of documents. I searched the net but couldn't find something. Could anyone suggest something about this issue? Thanks in advance, -- Deniz --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org