For the case of ligatures such as fi -> fi , fl -> fl, ae -> æ etc I'd suggest a 
better solution would be for GSearch/Solr to be able to cope with them.  Though 
Richard's examples are not ligatures.

Although no longer as common as they used to be (mainly due to typesetting 
being put into the hands of the masses with the rise of personal computers and 
word processing), PDF is not the only place where ligatures are likely to crop 
up, eg TeX output will substitute ligatures automatically, and output from DTP 
packages may also contain them - so as the scope of archived material broadens 
there will be more of a need to deal with these cases.

There are valid Unicode characters for these, though of course cases where 
custom fonts or other non-standard techinques are used makes the problem a lot 
more difficult.  And availability of ligatures varies between fonts.

(But I agree that pdf is the devil, but for many other reasons!).

 Solr can be configured to use various filters to cope with various weird and 
wonderful characters - this blog post 
http://blog.tremend.ro/2007/08/28/create-a-solr-filter-that-replace-diacritics/ 
seems to suggest you can implement your own custom filters, though there may be 
ready-made Solr filters that you could use.

Richard's examples below however are not ligatures, they are Unicode NUL and VT 
(vertical tab) - and the issue here is that these characters (escaped) are not 
valid XML.  The only characters below 20h (32) that are valid in XML are 09h, 
0Ah (10) and 0Dh (13) - tab/cr/lf.

There's a post about it here: 
http://www.nabble.com/invalid-XML-character-td15781177.html

If there's any place you can hook in before the XML is parsed, you could try 
and remove all the escaped characters that are not valid in XML at this point - 
there's a reference here that lists what characters are valid in XML: 
http://www.w3.org/TR/REC-xml/#charsets.  Alternatively the code for whatever is 
producing the XML needs modifying to drop characters that are not valid in XML 
(ie to produce only valid XML!).

Steve

-----Original Message-----
From: Benjamin O'Steen [mailto:[EMAIL PROTECTED]
Sent: 19 November 2008 14:56
To: Richard Green; [email protected]; drama
Subject: Re: [Fedora-commons-users] Indexing errors



Well, to cut to the chase, you just have to silently drop the bad output from 
the pdf. You could attempt to spend a lot of time and resources tracking down 
why things are broken, but it would be a never-ending task. For example, some 
type-setters have a tendancy to replace fi with a single character, just so 
that on the pdf, the kerning of those letters look good. Of course, it is 
fairly meaningless to a pdftotxt type application, which is why pdf is the 
devil ;)

(And I've spotted many more than one way that the special kerned 'characters', 
some using custom fonts, not using a standard mapping of course. I've mostly 
given up on this, and have to accept the adage of 'garbage in, garbage out' - 
not great, but I don't have the resources to do better.)

Ben

-----Original Message-----
From: Richard Green [mailto:[EMAIL PROTECTED]
Sent: Wed 11/19/2008 11:09 AM
To: [email protected]; drama
Subject: [Fedora-commons-users] Indexing errors

Has anyone a solution to this one please?



Occasionally we have a pdf submitted to the repository (with no source file to 
fall back on) that contains 'strange' characters.  An attempt to index it with 
GSearch would return an error (in previous versions of
Muradora) and the current Solr indexing in Muradora does the same thing, thus 
for example:



<str name="indexErrors">

Error indexing file 'hull_9': ParseError at [row,col]:[862,16]

Message: Character reference "&#0" is an invalid XML character.

</str>



<str name="indexErrors">

Error indexing file 'hull_590': ParseError at [row,col]:[33,50]

Message: Character reference "&#11" is an invalid XML character.

</str>



On the odd occasions that we feel the need to do a rebuild or re-index this 
causes us real problems.  We have just "lost" five objects from the repository 
because they no longer appear in the Solr indexes.  Sure, we can get them back 
with a little messing about but it is time-consuming. Does anyone have a robust 
solution to this please?



Richard



___________________________________________________________________



Richard Green

Manager, RepoMMan, RIDIR and REMAP Projects

e-Services Integration Group



www.hull.ac.uk/esig/repomman

www.hull.ac.uk/ridir

www.hull.ac.uk/remap

http://edocs.hull.ac.uk





-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge 
Build the coolest Linux based applications with Moblin SDK & win great prizes 
Grand prize is a trip for two to an Open Source event anywhere in the world 
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Fedora-commons-users mailing list [email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to