Mark,
Mia is "spot on" about ABBYY being the 900-lb player in the room on the
commercial software side of things. And the Transkribus project is an
interesting choice when moving your requirements more into the handwriting
recognition/transcription area.
The first helpful thing is to understand that despite their similarity; OCR of
print documents and handwritten documents (and the systems that handle them)
are very different. So an important thing to know to better answer your
question is how "massive" is massive? And given that total, what is the
breakdown between typewritten and handwritten source material.
You may find that thinking about and processing your collections will be best
served by keeping these workflows separate in both the technical and human
process that you develop.
For OCR of typewritten source material, even if your project will be scanning
thousands of documents, a good question to ask is how many human scanners will
you have working at any one time? If the actual number of scanners is low and
you just want to "get it done" and move on, you can't beat the "prosumer" ABBYY
product, FineReader. It is available in both Mac and Windows version.
The biggest gotcha using ABBYY's excellent FineReader product is that it does
not generate the abbyy.xml file that is provided by the ABBYY OCR Engine (an
separate "enterprise"/developers product). This file is essential if you need
to derive any metadata from the information generated by the process of OCRing.
But if the lack of this "hardcore" OCR metadata is no problem, you may find
that FineReader is a good choice.
While automated handwriting recognition is possible and getting better all the
time, systems to digitize handwritten source material is almost exclusively the
domain of human transcription workflow, like Transkribus.
The closest thing I have found to a "sweet spot" of a tool to handle both print
and handwritten workflows is PRImA's desktop Aletheia (and its WebAletheia
version).
As you research your options, you may find some of my recent writing useful:
* a 5-part series on Transkribus culminates with a piece that looks at the
Transkribus Recognition Platform as a Social Machine and envisions "raising the
bar" on crowdsourcing Citizen Science (https://goo.gl/ALTJYY - each article has
a set of links to the other parts)
* these three "Ground Truth &..." articles are about PRImA and Aletheia:
"Ground Truth & Softalk Magazine" (https://goo.gl/JwAKwr) and "Ground Truth &
the Internet Archive" (https://goo.gl/ZSM6n2) and "Ground Truth & the Knight
Prototype Fund" (https://goo.gl/YR70Yf)
The third of the Ground Truth articles is about our current FactMiners
collaboration with PRImA and eMOP (the Early Modern OCR Project out of Texas
A&M's IDHMC). For "going to the source" of what is doable with regard to the
intersection of OCR and document transcription, you will want to scour both
their websites at:
http://www.primaresearch.org/ and http://emop.tamu.edu/
Hope this helps.
Happy-Healthy Vibes,
-: Jim :-
Jim Salmons
Twitter: @Jim_Salmons, @FactMiners, @Softalk_Apple
www.FactMiners.org
www.SoftalkApple.com
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf
> Of Mia
> Sent: Thursday, November 19, 2015 9:51 AM
> To: Museum Computer Network Listserv <[email protected]>
> Subject: Re: [MCN-L] OCR Software
>
> Abbyy (http://www.abbyy.com/) seems to be the market leader for OCR. I'm
> sure others will have examples of training OCR packages on their material and
> building OCR into their digitisation workflow. Platforms like the Internet
> Archive also produce OCR texts from uploaded files, and other tools are listed
> at http://www.digitisation.eu/tools-resources/demonstrator-platform/
>
> We've been experimenting with Transkribus (
> https://transkribus.eu/Transkribus/) for handwritten text recognition.
>
> Best regards,
>
> Mia
>
>
> Dr Mia Ridge
> Digital Curator, British Library
>
> --------------------------------------------
> http://openobjects.org.uk/
> http://twitter.com/mia_out
> Check out my book! http://bit.ly/CrowdsourcingCulturalHeritage
> <http://bit.ly/CrowdsourcingCulturalHeritage>
> I mostly use this address for list mail; my open.ac.uk address is checked
> daily
>
> On 18 November 2015 at 17:53, Locker, Mark <[email protected]> wrote:
>
> > Hi,
> > We are beginning a massive digitization project in which we will be
> > scanning thousands of documents. Most will probably be typewritten but
> > definitely will have handwritten documents as well. Anyone out there
> > using OCR software, especially anything that manages handwriting
> successfully?
> > Thanks,
> > Mark
> >
> > Mark T. Locker
> > Data Manager, DNA
> > Office: 503.532.3280
> > Cell: 503.810.2461
> > dna.nike.com
> >
> >
> > _______________________________________________
> > You are currently subscribed to mcn-l, the listserv of the Museum
> > Computer Network (http://www.mcn.edu)
> >
> > To post to this list, send messages to: [email protected]
> >
> > To unsubscribe or change mcn-l delivery options visit:
> > http://mcn.edu/mailman/listinfo/mcn-l
> >
> > The MCN-L archives can be found at:
> > http://www.mail-archive.com/[email protected]/
> >
> >
_______________________________________________
You are currently subscribed to mcn-l, the listserv of the Museum Computer
Network (http://www.mcn.edu)
To post to this list, send messages to: [email protected]
To unsubscribe or change mcn-l delivery options visit:
http://mcn.edu/mailman/listinfo/mcn-l
The MCN-L archives can be found at:
http://www.mail-archive.com/[email protected]/