Re: [CODE4LIB] Extracting Text From .tiff Files

Stuart Yeates Mon, 12 May 2014 15:27:07 -0700

Your first step is to pin down the format. TIFF is a container form (like zip) 
and can contain pretty much anything. Likely candidates for you format include 
https://en.wikipedia.org/wiki/IPTC_Information_Interchange_Model and 
https://en.wikipedia.org/wiki/Extensible_Metadata_Platform


Your second step is to find a library / tool for your platform that supports 
your format. 

Cheers
stuart

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Gavin 
Spomer
Sent: Tuesday, 13 May 2014 10:01 a.m.
To: [email protected]
Subject: [CODE4LIB] Extracting Text From .tiff Files

Hello folks, 

I'm in the process of migrating a student newspaper collection, currently 
implemented with ResCarta, into our new bepress institutional repository. 
ResCarta has each page of a newspaper stored as a tiff file. Not only does the 
tiff file contain the graphics data, but it has some metadata in xml format and 
the fulltext of the page. I know this because I opened up some of the tiffs 
with a plain-text editor (Vim). 

Although I can see the text in the file, I've only been about 90% accurate in 
extracting it with a script. Some of those "weird" characters seem to do some 
wonky things when doing file IO for some reason. Is there a more reliable way 
to extract text stored in a tiff file? I've Googled and Googled and have pulled 
up almost nothing. But there's got to be a way, since ResCarta stores it there 
and can extract it. 

Any ideas? 
Gavin Spomer
Systems Programmer
Brooks Library
Central Washington University

Re: [CODE4LIB] Extracting Text From .tiff Files

Reply via email to