Re: [MCN-L] OCR Software

Jim Salmons Thu, 19 Nov 2015 12:34:03 -0800

Mark,

Mia is "spot on" about ABBYY being the 900-lb player in the room on the 
commercial software side of things. And the Transkribus project is an 
interesting choice when moving your requirements more into the handwriting 
recognition/transcription area.

The first helpful thing is to understand that despite their similarity; OCR of 
print documents and handwritten documents (and the systems that handle them) 
are very different. So an important thing to know to better answer your 
question is how "massive" is massive? And given that total, what is the 
breakdown between typewritten and handwritten source material.

You may find that thinking about and processing your collections will be best 
served by keeping these workflows separate in both the technical and human 
process that you develop.

For OCR of typewritten source material, even if your project will be scanning 
thousands of documents, a good question to ask is how many human scanners will 
you have working at any one time? If the actual number of scanners is low and 
you just want to "get it done" and move on, you can't beat the "prosumer" ABBYY 
product, FineReader. It is available in both Mac and Windows version. 

The biggest gotcha using ABBYY's excellent FineReader product is that it does 
not generate the abbyy.xml file that is provided by the ABBYY OCR Engine (an 
separate "enterprise"/developers product). This file is essential if you need 
to derive any metadata from the information generated by the process of OCRing. 
But if the lack of this "hardcore" OCR metadata is no problem, you may find 
that FineReader is a good choice.

While automated handwriting recognition is possible and getting better all the 
time, systems to digitize handwritten source material is almost exclusively the 
domain of human transcription workflow, like Transkribus.

The closest thing I have found to a "sweet spot" of a tool to handle both print 
and handwritten workflows is PRImA's desktop Aletheia (and its WebAletheia 
version).

As you research your options, you may find some of my recent writing useful:

    * a 5-part series on Transkribus culminates with a piece that looks at the 
Transkribus Recognition Platform as a Social Machine and envisions "raising the 
bar" on crowdsourcing Citizen Science (https://goo.gl/ALTJYY - each article has 
a set of links to the other parts)

    * these three "Ground Truth &..." articles are about PRImA and Aletheia: 
"Ground Truth & Softalk Magazine" (https://goo.gl/JwAKwr) and "Ground Truth & 
the Internet Archive" (https://goo.gl/ZSM6n2) and "Ground Truth & the Knight 
Prototype Fund" (https://goo.gl/YR70Yf)

The third of the Ground Truth articles is about our current FactMiners 
collaboration with PRImA and eMOP (the Early Modern OCR Project out of Texas 
A&M's IDHMC). For "going to the source" of what is doable with regard to the 
intersection of OCR and document transcription, you will want to scour both 
their websites at:

        http://www.primaresearch.org/ and http://emop.tamu.edu/

Hope this helps.

    Happy-Healthy Vibes,
    -: Jim :-

    Jim Salmons
    Twitter: @Jim_Salmons, @FactMiners, @Softalk_Apple
    www.FactMiners.org 
    www.SoftalkApple.com 

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf
> Of Mia
> Sent: Thursday, November 19, 2015 9:51 AM
> To: Museum Computer Network Listserv <[email protected]>
> Subject: Re: [MCN-L] OCR Software
> 
> Abbyy (http://www.abbyy.com/) seems to be the market leader for OCR. I'm
> sure others will have examples of training OCR packages on their material and
> building OCR into their digitisation workflow. Platforms like the Internet
> Archive also produce OCR texts from uploaded files, and other tools are listed
> at http://www.digitisation.eu/tools-resources/demonstrator-platform/
> 
> We've been experimenting with Transkribus (
> https://transkribus.eu/Transkribus/) for handwritten text recognition.
> 
> Best regards,
> 
> Mia
> 
> 
> Dr Mia Ridge
> Digital Curator, British Library
> 
> --------------------------------------------
> http://openobjects.org.uk/
> http://twitter.com/mia_out
> Check out my book! http://bit.ly/CrowdsourcingCulturalHeritage
> <http://bit.ly/CrowdsourcingCulturalHeritage>
> I mostly use this address for list mail; my open.ac.uk address is checked 
> daily
> 
> On 18 November 2015 at 17:53, Locker, Mark <[email protected]> wrote:
> 
> > Hi,
> > We are beginning a massive digitization project in which we will be
> > scanning thousands of documents. Most will probably be typewritten but
> > definitely will have handwritten documents as well. Anyone out there
> > using OCR software, especially anything that manages handwriting
> successfully?
> > Thanks,
> > Mark
> >
> > Mark T. Locker
> > Data Manager, DNA
> > Office: 503.532.3280
> > Cell: 503.810.2461
> > dna.nike.com
> >
> >
> > _______________________________________________
> > You are currently subscribed to mcn-l, the listserv of the Museum
> > Computer Network (http://www.mcn.edu)
> >
> > To post to this list, send messages to: [email protected]
> >
> > To unsubscribe or change mcn-l delivery options visit:
> > http://mcn.edu/mailman/listinfo/mcn-l
> >
> > The MCN-L archives can be found at:
> > http://www.mail-archive.com/[email protected]/
> >
> >

_______________________________________________
You are currently subscribed to mcn-l, the listserv of the Museum Computer 
Network (http://www.mcn.edu)

To post to this list, send messages to: [email protected]

To unsubscribe or change mcn-l delivery options visit:
http://mcn.edu/mailman/listinfo/mcn-l

The MCN-L archives can be found at:
http://www.mail-archive.com/[email protected]/

Re: [MCN-L] OCR Software

Reply via email to