To confirm, Lucene does not perform OCR.  (If you are looking for open source 
java ocr packages, you might take a look here for some ideas: 
https://issues.apache.org/jira/i#browse/TIKA-93).  Are you trying to find a 
corpus of noisy OCR'd text to use as input into Lucene?  If so, this looks 
potentially useful: http://chroniclingamerica.loc.gov/ocr/. Don't know how well 
its error rates match yours...
 
-----Original Message-----
From: Deniz Atak [mailto:deniza...@gmail.com] 
Sent: Thursday, January 16, 2014 2:43 PM
To: java-user@lucene.apache.org
Subject: Sample Data to Test Lucene

Hi,

we are new to Lucene. We would like to use Lucene for our archive project.
In this project we have to get some images of documents, get text out of
them via OCR and index them using Lucene. In order to see if Lucene is
suitable for our project we need to test Lucene with sample data. But we
need huge data set that is composed of images of documents. I searched the
net but couldn't find something. Could anyone suggest something about this
issue?

Thanks in advance,

-- 
Deniz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to