I'm playing around with the 'search inside' feature of the BookReader, and am trying to understand the inside.py code.
I've grabbed a sample set of data from the Internet Archives (https://ia600206.us.archive.org/19/items/yearbook1981east/) and have loaded the .djvu.txt file into a single field in solr, and am using the .abbyy.gz file as the file to parse for word coordinates. The querying of solr and finding a match works great, but I never find a match in the abbyy.gz file. What appears to be happening, to my novice eye that is still trying to grasp python generators, is that the line: if re_braces.sub('', cur['text']) != abbyy_text: always returns false if the abbyy_text has a unicode character in it, e.g. an em-das "DRAFTING AND DESIGN — SOPHOMORES" Though that doesn't completely make sense to me since type(cur['text']) is 'unicode' and type('abbyy_text') is 'str', and I never get a UnicodeWarning of unequal comparison (which I get if I start fiddling with the encoding of those strings) and both the data sources do really seem to be utf-8. Does anyone have any ideas why I might be seeing this, or care to enlighten me on what I may very well be completely misunderstanding? Thanks.
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
