Well, to cut to the chase, you just have to silently drop the bad output from the pdf. You could attempt to spend a lot of time and resources tracking down why things are broken, but it would be a never-ending task. For example, some type-setters have a tendancy to replace fi with a single character, just so that on the pdf, the kerning of those letters look good. Of course, it is fairly meaningless to a pdftotxt type application, which is why pdf is the devil ;)
(And I've spotted many more than one way that the special kerned 'characters', some using custom fonts, not using a standard mapping of course. I've mostly given up on this, and have to accept the adage of 'garbage in, garbage out' - not great, but I don't have the resources to do better.) Ben -----Original Message----- From: Richard Green [mailto:[EMAIL PROTECTED] Sent: Wed 11/19/2008 11:09 AM To: [email protected]; drama Subject: [Fedora-commons-users] Indexing errors Has anyone a solution to this one please? Occasionally we have a pdf submitted to the repository (with no source file to fall back on) that contains 'strange' characters. An attempt to index it with GSearch would return an error (in previous versions of Muradora) and the current Solr indexing in Muradora does the same thing, thus for example: <str name="indexErrors"> Error indexing file 'hull_9': ParseError at [row,col]:[862,16] Message: Character reference "�" is an invalid XML character. </str> <str name="indexErrors"> Error indexing file 'hull_590': ParseError at [row,col]:[33,50] Message: Character reference "" is an invalid XML character. </str> On the odd occasions that we feel the need to do a rebuild or re-index this causes us real problems. We have just "lost" five objects from the repository because they no longer appear in the Solr indexes. Sure, we can get them back with a little messing about but it is time-consuming. Does anyone have a robust solution to this please? Richard ___________________________________________________________________ Richard Green Manager, RepoMMan, RIDIR and REMAP Projects e-Services Integration Group www.hull.ac.uk/esig/repomman www.hull.ac.uk/ridir www.hull.ac.uk/remap http://edocs.hull.ac.uk ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
