Well, to cut to the chase, you just have to silently drop the bad output
from the pdf. You could attempt to spend a lot of time and resources
tracking down why things are broken, but it would be a never-ending
task. For example, some type-setters have a tendancy to replace fi with
a single character, just so that on the pdf, the kerning of those
letters look good. Of course, it is fairly meaningless to a pdftotxt
type application, which is why pdf is the devil ;)

(And I've spotted many more than one way that the special kerned
'characters', some using custom fonts, not using a standard mapping of
course. I've mostly given up on this, and have to accept the adage of
'garbage in, garbage out' - not great, but I don't have the resources to
do better.)

Ben

-----Original Message-----
From: Richard Green [mailto:[EMAIL PROTECTED]
Sent: Wed 11/19/2008 11:09 AM
To: [email protected]; drama
Subject: [Fedora-commons-users] Indexing errors
 
Has anyone a solution to this one please?

 

Occasionally we have a pdf submitted to the repository (with no source
file to fall back on) that contains 'strange' characters.  An attempt to
index it with GSearch would return an error (in previous versions of
Muradora) and the current Solr indexing in Muradora does the same thing,
thus for example:

 

<str name="indexErrors">

Error indexing file 'hull_9': ParseError at [row,col]:[862,16]

Message: Character reference "&#0" is an invalid XML character.

</str>

 

<str name="indexErrors">

Error indexing file 'hull_590': ParseError at [row,col]:[33,50]

Message: Character reference "&#11" is an invalid XML character.

</str>

 

On the odd occasions that we feel the need to do a rebuild or re-index
this causes us real problems.  We have just "lost" five objects from the
repository because they no longer appear in the Solr indexes.  Sure, we
can get them back with a little messing about but it is time-consuming.
Does anyone have a robust solution to this please?

 

Richard

 

___________________________________________________________________

 

Richard Green

Manager, RepoMMan, RIDIR and REMAP Projects

e-Services Integration Group

 

www.hull.ac.uk/esig/repomman

www.hull.ac.uk/ridir

www.hull.ac.uk/remap

http://edocs.hull.ac.uk

 



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to