For the case of ligatures such as fi -> fi , fl -> fl, ae -> æ etc I'd suggest a better solution would be for GSearch/Solr to be able to cope with them. Though Richard's examples are not ligatures.
Although no longer as common as they used to be (mainly due to typesetting being put into the hands of the masses with the rise of personal computers and word processing), PDF is not the only place where ligatures are likely to crop up, eg TeX output will substitute ligatures automatically, and output from DTP packages may also contain them - so as the scope of archived material broadens there will be more of a need to deal with these cases. There are valid Unicode characters for these, though of course cases where custom fonts or other non-standard techinques are used makes the problem a lot more difficult. And availability of ligatures varies between fonts. (But I agree that pdf is the devil, but for many other reasons!). Solr can be configured to use various filters to cope with various weird and wonderful characters - this blog post http://blog.tremend.ro/2007/08/28/create-a-solr-filter-that-replace-diacritics/ seems to suggest you can implement your own custom filters, though there may be ready-made Solr filters that you could use. Richard's examples below however are not ligatures, they are Unicode NUL and VT (vertical tab) - and the issue here is that these characters (escaped) are not valid XML. The only characters below 20h (32) that are valid in XML are 09h, 0Ah (10) and 0Dh (13) - tab/cr/lf. There's a post about it here: http://www.nabble.com/invalid-XML-character-td15781177.html If there's any place you can hook in before the XML is parsed, you could try and remove all the escaped characters that are not valid in XML at this point - there's a reference here that lists what characters are valid in XML: http://www.w3.org/TR/REC-xml/#charsets. Alternatively the code for whatever is producing the XML needs modifying to drop characters that are not valid in XML (ie to produce only valid XML!). Steve -----Original Message----- From: Benjamin O'Steen [mailto:[EMAIL PROTECTED] Sent: 19 November 2008 14:56 To: Richard Green; [email protected]; drama Subject: Re: [Fedora-commons-users] Indexing errors Well, to cut to the chase, you just have to silently drop the bad output from the pdf. You could attempt to spend a lot of time and resources tracking down why things are broken, but it would be a never-ending task. For example, some type-setters have a tendancy to replace fi with a single character, just so that on the pdf, the kerning of those letters look good. Of course, it is fairly meaningless to a pdftotxt type application, which is why pdf is the devil ;) (And I've spotted many more than one way that the special kerned 'characters', some using custom fonts, not using a standard mapping of course. I've mostly given up on this, and have to accept the adage of 'garbage in, garbage out' - not great, but I don't have the resources to do better.) Ben -----Original Message----- From: Richard Green [mailto:[EMAIL PROTECTED] Sent: Wed 11/19/2008 11:09 AM To: [email protected]; drama Subject: [Fedora-commons-users] Indexing errors Has anyone a solution to this one please? Occasionally we have a pdf submitted to the repository (with no source file to fall back on) that contains 'strange' characters. An attempt to index it with GSearch would return an error (in previous versions of Muradora) and the current Solr indexing in Muradora does the same thing, thus for example: <str name="indexErrors"> Error indexing file 'hull_9': ParseError at [row,col]:[862,16] Message: Character reference "�" is an invalid XML character. </str> <str name="indexErrors"> Error indexing file 'hull_590': ParseError at [row,col]:[33,50] Message: Character reference "" is an invalid XML character. </str> On the odd occasions that we feel the need to do a rebuild or re-index this causes us real problems. We have just "lost" five objects from the repository because they no longer appear in the Solr indexes. Sure, we can get them back with a little messing about but it is time-consuming. Does anyone have a robust solution to this please? Richard ___________________________________________________________________ Richard Green Manager, RepoMMan, RIDIR and REMAP Projects e-Services Integration Group www.hull.ac.uk/esig/repomman www.hull.ac.uk/ridir www.hull.ac.uk/remap http://edocs.hull.ac.uk ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
