Re: [ol-tech] understanding inside.py

Karen Coyle Sat, 09 Nov 2013 15:03:06 -0800

Sara, have you looked at the abbyy file? I don't know if you've parsed 
the file or if there are functions you are using that would make sense 
of it (I can't find an inside.py that does this), but as explained to me 
(and what I can see in the file) is that the abbyy file only has 
individual letters, not words. So this is an entry in the file:


<charParams l="1929" t="1695" r="1936" b="1721" wordStart="true" 
wordFromDictionary="false" wordNormal="true" wordNumeric="false" 
wordIdentifier="false" charConfidence="32" serifProbability="255" 
wordPenalty="2" meanStrokeWidth="24">i</charParams>

The "i" there is the character by itself. If I'm not totally all wet, 
this block of code contains the word "Digitized":

<charParams l="662" t="1790" r="728" b="1874" wordStart="true" 
wordFromDictionary="true" wordNormal="true" wordNumeric="false" 
wordIdentifier="false" charConfidence="100" serifProbability="0" 
wordPenalty="0" meanStrokeWidth="105">D</charParams><charParams l="742" 
t="1790" r="752" b="1874" wordStart="false" wordFromDictionary="true" 
wordNormal="true" wordNumeric="false" wordIdentifier="false" 
charConfidence="100" serifProbability="255" wordPenalty="0" 
meanStrokeWidth="105">i</charParams><charParams l="762" t="1812" r="816" 
b="1898" wordStart="false" wordFromDictionary="true" wordNormal="true" 
wordNumeric="false" wordIdentifier="false" charConfidence="100" 
serifProbability="19" wordPenalty="0" 
meanStrokeWidth="105">g</charParams><charParams l="830" t="1790" r="840" 
b="1874" wordStart="false" wordFromDictionary="true" wordNormal="true" 
wordNumeric="false" wordIdentifier="false" charConfidence="100" 
serifProbability="255" wordPenalty="0" 
meanStrokeWidth="105">i</charParams><charParams l="850" t="1798" r="878" 
b="1876" wordStart="false" wordFromDictionary="true" wordNormal="true" 
wordNumeric="false" wordIdentifier="false" charConfidence="100" 
serifProbability="28" wordPenalty="0" 
meanStrokeWidth="105">t</charParams><charParams l="886" t="1790" r="896" 
b="1874" wordStart="false" wordFromDictionary="true" wordNormal="true" 
wordNumeric="false" wordIdentifier="false" charConfidence="100" 
serifProbability="255" wordPenalty="0" 
meanStrokeWidth="105">i</charParams><charParams l="908" t="1814" r="956" 
b="1874" wordStart="false" wordFromDictionary="true" wordNormal="true" 
wordNumeric="false" wordIdentifier="false" charConfidence="100" 
serifProbability="255" wordPenalty="0" 
meanStrokeWidth="105">z</charParams><charParams l="964" t="1812" 
r="1018" b="1876" wordStart="false" wordFromDictionary="true" 
wordNormal="true" wordNumeric="false" wordIdentifier="false" 
charConfidence="100" serifProbability="40" wordPenalty="0" 
meanStrokeWidth="105">e</charParams><charParams l="1026" t="1790" 
r="1080" b="1876" wordStart="false" wordFromDictionary="true" 
wordNormal="true" wordNumeric="false" wordIdentifier="false" 
charConfidence="100" serifProbability="0" wordPenalty="0" 
meanStrokeWidth="105">d</charParams><charParams l="1080" t="1790" 
r="1124" b="1876" wordStart="false" wordFromDictionary="false" 
wordNormal="false" wordNumeric="false" wordIdentifier="false" 
charConfidence="255" serifProbability="255" wordPenalty="0" 
meanStrokeWidth="0"> </charParams>

Now, how to go from this to a page position for the *word* is TOTALLY 
beyond me. If anyone knows how to do this it is Mike McCabe of the 
Archive, who either wrote or understands the code that transforms the 
abbyy files into djvu, epub, mobi, etc. He doesn't follow this list, 
though. You should be able to find him through the IA staff pages.

Good luck!

kc

On 11/7/13 12:04 PM, Sara Amato wrote:
> I'm playing around with the 'search inside' feature of the BookReader,
> and am trying to understand the inside.py code.
>
> I've grabbed a sample set of data from the Internet Archives
> (https://ia600206.us.archive.org/19/items/yearbook1981east/) and have
> loaded the .djvu.txt file into a single field in solr, and am using the
> .abbyy.gz file as the file to parse for word coordinates. The querying
> of solr and finding a match works great, but I never find a match in the
> abbyy.gz file.
>
> What appears to be happening, to my novice eye that is still trying to
> grasp python generators, is that the line:
>
>          if re_braces.sub('', cur['text']) != abbyy_text:
>
> always returns false if the abbyy_text has a unicode character in it,
> e.g. an em-das "DRAFTING AND DESIGN — SOPHOMORES"
>
> Though that doesn't completely make sense to me since type(cur['text'])
>   is 'unicode' and type('abbyy_text') is 'str',  and I never get a
> UnicodeWarning of unequal comparison (which I get if I start fiddling
> with the encoding of those strings) and both the data sources do really
> seem to be utf-8.
>
> Does anyone have any ideas why I might be seeing this, or care to
> enlighten me on what I may very well be completely misunderstanding?
>
> Thanks.
>
>
>
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> Archives: http://www.mail-archive.com/[email protected]/
> To unsubscribe from this mailing list, send email to 
> [email protected]
>

-- 
Karen Coyle
[email protected] http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] understanding inside.py

Reply via email to