I'm playing around with the 'search inside' feature of the BookReader, and am 
trying to understand the inside.py code.   

I've grabbed a sample set of data from the Internet Archives 
(https://ia600206.us.archive.org/19/items/yearbook1981east/) and have loaded 
the .djvu.txt file into a single field in solr, and am using the .abbyy.gz file 
as the file to parse for word coordinates. The querying of solr and finding a 
match works great, but I never find a match in the abbyy.gz file.

What appears to be happening, to my novice eye that is still trying to grasp 
python generators, is that the line:

        if re_braces.sub('', cur['text']) != abbyy_text:

always returns false if the abbyy_text has a unicode character in it, e.g. an 
em-das "DRAFTING AND DESIGN — SOPHOMORES"

Though that doesn't completely make sense to me since type(cur['text'])  is 
'unicode' and type('abbyy_text') is 'str',  and I never get a UnicodeWarning of 
unequal comparison (which I get if I start fiddling with the encoding of those 
strings) and both the data sources do really seem to be utf-8.

Does anyone have any ideas why I might be seeing this, or care to enlighten me 
on what I may very well be completely misunderstanding?

Thanks. 

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to