I'v run across this problem too, not in this context but with another data/repository type application. I came across this suggestion on the net:
"Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML): Download atlassian-xml-cleaner-0.1.jar Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml This will write a copy of data.xml to data-clean.xml, with invalid characters removed. " I'v seen quite a few recommendations for it. I wonder if it might be worth a try for your problem? Peri Stracchino Digital Library Team University of York ext 4082 -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 19 November 2008 17:25 To: [email protected] Subject: Fedora-commons-users Digest, Vol 21, Issue 15 Send Fedora-commons-users mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/fedora-commons-users or, via email, send a message with subject or body 'help' to [EMAIL PROTECTED] You can reach the person managing the list at [EMAIL PROTECTED] When replying, please edit your Subject line so it is more specific than "Re: Contents of Fedora-commons-users digest..." Today's Topics: 1. Re: Ingesting "newborn" digital objects Vs only long-term-preservation ones (Filipe Correia) 2. Indexing errors (Richard Green) 3. Re: Indexing errors (Benjamin O'Steen) 4. Re: Indexing errors (Steve Bayliss) ---------------------------------------------------------------------- Message: 1 Date: Tue, 18 Nov 2008 19:53:36 +0000 From: "Filipe Correia" <[EMAIL PROTECTED]> Subject: Re: [Fedora-commons-users] Ingesting "newborn" digital objects Vs only long-term-preservation ones To: "Uwe Klosa" <[EMAIL PROTECTED]> Cc: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=ISO-8859-1 I find that very interesting, as I was not aware there were other established certification requirements, besides the one I mentioned, by the OCLC. I will search for that info, but will also appreciate any url you may provide about those other certification efforts. I hope they may better be in sync with our particular scenario. Thank you once again! Best regards, Filipe Correia On Tue, Nov 18, 2008 at 6:50 PM, Uwe Klosa <[EMAIL PROTECTED]> wrote: > We did have plans of a repository certification few years ago. This is > quite a dragging process and we do not have the man power to carry it > out. We did carry out a pre-study 2 years where we compared the German > and Australian concepts and rules for certification of repository > servers. After that has nothing happened here in Sweden in this area as far as I am aware. > > Kind regards > Uwe > > On Tue, Nov 18, 2008 at 7:37 PM, Filipe Correia <[EMAIL PROTECTED]> wrote: >> >> Hello Uwe, >> >> Thank you for your reply! >> Your scenario is indeed one I can relate to. Do you have any plans >> for repository certification, on a long-term preservation >> perspective? The preservation checklist I've mentioned appears to me >> as not being entirely compatible with the use of a repository on a >> documents production phase, rather only on a phase when all documents >> are defined for "permanent" preservation. >> >> Thanks, >> Filipe Correia >> >> On Sun, Nov 16, 2008 at 1:08 PM, Uwe Klosa <[EMAIL PROTECTED]> wrote: >> > In DiVA we're using Fedora as an repository for two purposes. The >> > first one is to archive publication metadata and files. The second >> > one is two support a publication workflow for different types of >> > publications. We do have mainly a publication workflow for doctoral >> > theses where the digital object is the original thesis. Furthermore >> > the system is used by an Open Access journal where also the digital >> > objects are the orignals. >> > >> > I think there have been discussions to use Fedora behind a CMS. But >> > I do not know if there is one out there. >> > >> > Regards >> > Uwe Klosa >> > >> > On Sat, Nov 15, 2008 at 9:07 PM, Filipe Correia >> > <[EMAIL PROTECTED]> >> > wrote: >> >> >> >> Dear Fedora Commons community, >> >> >> >> We are currently studying the best approach for an institutional >> >> repository using fedora, but are finding some difficulties. >> >> >> >> It's easy to find fedora case studies on the Web, but our scenario >> >> is somewhat different from all the others we are finding, even >> >> though we find it hard to believe we are the firsts with such a use case. >> >> >> >> Let me explain. We are trying to address two main concerns: >> >> - Nowadays, all of our new documents are digital ones --- even if >> >> they are paper documents, they are turned into a digital form when >> >> they enter our institution. All of these documents are currently >> >> stored on a very archaic repository, which doesn't provide us with >> >> the control access we would like, doesn't properly handle unique >> >> object identification, and doesn't really scale to a much larger >> >> number of files. >> >> - The long term preservation of these objects hasn't really been >> >> thought out before, but we want to start taking it into account. >> >> >> >> So, our thoughts were to start using fedora, and ingesting digital >> >> objects from the moment they appear on the organization. Our >> >> document management system (which handles the document workflow) >> >> would make use of the underlying fedora infrastructure by >> >> maintaining references to the appropriate digital objects. Right >> >> on the moment our digital objects were "born" they wouldn't have >> >> much associated metadata, but it would grow as the object would be >> >> further used, throughout the organization. >> >> >> >> At the same time, taking our long term preservation concerns into >> >> account, we have been looking at the OAIS model and at this >> >> repository certification checklist: >> >> http://bibpurl.oclc.org/web/16712 These sound to us as very wise >> >> advises but, at the same time, poses some >> >> doubts: >> >> - For certification purposes, the repository would have to >> >> maintain >> >> *only* long-term-preservation objects, but if we ingest "newborn" >> >> objects, that will not be the case. Only later the objects will be >> >> evaluated and considered worthy of long-term preservation, or not >> >> --- in which case they can be discarded. >> >> - For our day to day administrative activities it makes perfect >> >> sense to use a repository since the moment the document is >> >> created... does this mean we will have to have a second >> >> repository to which to copy the digital objects when the time >> >> comes? (this seems silly to us... Is >> >> it?) >> >> >> >> >> >> If someone has faced a similar scenario, we would really love to >> >> hear your take on this matter. It seems to us that repository >> >> models are being thought out mainly for end-of-the-line archiving >> >> (that is, when they have served their main administrative >> >> purposes), but we think it could be very useful to use them sooner on the document life-cycle. >> >> >> >> >> >> Thank you in advance! >> >> >> >> Regards, >> >> Filipe Correia >> >> >> >> >> >> ------------------------------------------------------------------ >> >> ------- This SF.Net email is sponsored by the Moblin Your Move >> >> Developer's challenge Build the coolest Linux based applications >> >> with Moblin SDK & win great prizes Grand prize is a trip for two >> >> to an Open Source event anywhere in the world >> >> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >> >> _______________________________________________ >> >> Fedora-commons-users mailing list >> >> [email protected] >> >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >> > >> > > > ------------------------------ Message: 2 Date: Wed, 19 Nov 2008 11:09:00 -0000 From: "Richard Green" <[EMAIL PROTECTED]> Subject: [Fedora-commons-users] Indexing errors To: <[email protected]>, "drama" <[EMAIL PROTECTED]> Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="us-ascii" Has anyone a solution to this one please? Occasionally we have a pdf submitted to the repository (with no source file to fall back on) that contains 'strange' characters. An attempt to index it with GSearch would return an error (in previous versions of Muradora) and the current Solr indexing in Muradora does the same thing, thus for example: <str name="indexErrors"> Error indexing file 'hull_9': ParseError at [row,col]:[862,16] Message: Character reference "�" is an invalid XML character. </str> <str name="indexErrors"> Error indexing file 'hull_590': ParseError at [row,col]:[33,50] Message: Character reference "" is an invalid XML character. </str> On the odd occasions that we feel the need to do a rebuild or re-index this causes us real problems. We have just "lost" five objects from the repository because they no longer appear in the Solr indexes. Sure, we can get them back with a little messing about but it is time-consuming. Does anyone have a robust solution to this please? Richard ___________________________________________________________________ Richard Green Manager, RepoMMan, RIDIR and REMAP Projects e-Services Integration Group www.hull.ac.uk/esig/repomman www.hull.ac.uk/ridir www.hull.ac.uk/remap http://edocs.hull.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available ------------------------------ Message: 3 Date: Wed, 19 Nov 2008 14:55:39 -0000 From: "Benjamin O'Steen" <[EMAIL PROTECTED]> Subject: Re: [Fedora-commons-users] Indexing errors To: "Richard Green" <[EMAIL PROTECTED]>, <[email protected]>, "drama" <[EMAIL PROTECTED]> Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="iso-8859-1" Well, to cut to the chase, you just have to silently drop the bad output from the pdf. You could attempt to spend a lot of time and resources tracking down why things are broken, but it would be a never-ending task. For example, some type-setters have a tendancy to replace fi with a single character, just so that on the pdf, the kerning of those letters look good. Of course, it is fairly meaningless to a pdftotxt type application, which is why pdf is the devil ;) (And I've spotted many more than one way that the special kerned 'characters', some using custom fonts, not using a standard mapping of course. I've mostly given up on this, and have to accept the adage of 'garbage in, garbage out' - not great, but I don't have the resources to do better.) Ben -----Original Message----- From: Richard Green [mailto:[EMAIL PROTECTED] Sent: Wed 11/19/2008 11:09 AM To: [email protected]; drama Subject: [Fedora-commons-users] Indexing errors Has anyone a solution to this one please? Occasionally we have a pdf submitted to the repository (with no source file to fall back on) that contains 'strange' characters. An attempt to index it with GSearch would return an error (in previous versions of Muradora) and the current Solr indexing in Muradora does the same thing, thus for example: <str name="indexErrors"> Error indexing file 'hull_9': ParseError at [row,col]:[862,16] Message: Character reference "�" is an invalid XML character. </str> <str name="indexErrors"> Error indexing file 'hull_590': ParseError at [row,col]:[33,50] Message: Character reference "" is an invalid XML character. </str> On the odd occasions that we feel the need to do a rebuild or re-index this causes us real problems. We have just "lost" five objects from the repository because they no longer appear in the Solr indexes. Sure, we can get them back with a little messing about but it is time-consuming. Does anyone have a robust solution to this please? Richard ___________________________________________________________________ Richard Green Manager, RepoMMan, RIDIR and REMAP Projects e-Services Integration Group www.hull.ac.uk/esig/repomman www.hull.ac.uk/ridir www.hull.ac.uk/remap http://edocs.hull.ac.uk ------------------------------ Message: 4 Date: Wed, 19 Nov 2008 17:24:47 -0000 From: "Steve Bayliss" <[EMAIL PROTECTED]> Subject: Re: [Fedora-commons-users] Indexing errors To: "'Benjamin O'Steen'" <[EMAIL PROTECTED]>, "'Richard Green'" <[EMAIL PROTECTED]>, <[email protected]>, "'drama'" <[EMAIL PROTECTED]> Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="UTF-8" For the case of ligatures such as fi -> ? , fl -> ?, ae -> ? etc I'd suggest a better solution would be for GSearch/Solr to be able to cope with them. Though Richard's examples are not ligatures. Although no longer as common as they used to be (mainly due to typesetting being put into the hands of the masses with the rise of personal computers and word processing), PDF is not the only place where ligatures are likely to crop up, eg TeX output will substitute ligatures automatically, and output from DTP packages may also contain them - so as the scope of archived material broadens there will be more of a need to deal with these cases. There are valid Unicode characters for these, though of course cases where custom fonts or other non-standard techinques are used makes the problem a lot more difficult. And availability of ligatures varies between fonts. (But I agree that pdf is the devil, but for many other reasons!). Solr can be configured to use various filters to cope with various weird and wonderful characters - this blog post http://blog.tremend.ro/2007/08/28/create-a-solr-filter-that-replace-diacriti cs/ seems to suggest you can implement your own custom filters, though there may be ready-made Solr filters that you could use. Richard's examples below however are not ligatures, they are Unicode NUL and VT (vertical tab) - and the issue here is that these characters (escaped) are not valid XML. The only characters below 20h (32) that are valid in XML are 09h, 0Ah (10) and 0Dh (13) - tab/cr/lf. There's a post about it here: http://www.nabble.com/invalid-XML-character-td15781177.html If there's any place you can hook in before the XML is parsed, you could try and remove all the escaped characters that are not valid in XML at this point - there's a reference here that lists what characters are valid in XML: http://www.w3.org/TR/REC-xml/#charsets. Alternatively the code for whatever is producing the XML needs modifying to drop characters that are not valid in XML (ie to produce only valid XML!). Steve -----Original Message----- From: Benjamin O'Steen [mailto:[EMAIL PROTECTED] Sent: 19 November 2008 14:56 To: Richard Green; [email protected]; drama Subject: Re: [Fedora-commons-users] Indexing errors Well, to cut to the chase, you just have to silently drop the bad output from the pdf. You could attempt to spend a lot of time and resources tracking down why things are broken, but it would be a never-ending task. For example, some type-setters have a tendancy to replace fi with a single character, just so that on the pdf, the kerning of those letters look good. Of course, it is fairly meaningless to a pdftotxt type application, which is why pdf is the devil ;) (And I've spotted many more than one way that the special kerned 'characters', some using custom fonts, not using a standard mapping of course. I've mostly given up on this, and have to accept the adage of 'garbage in, garbage out' - not great, but I don't have the resources to do better.) Ben -----Original Message----- From: Richard Green [mailto:[EMAIL PROTECTED] Sent: Wed 11/19/2008 11:09 AM To: [email protected]; drama Subject: [Fedora-commons-users] Indexing errors Has anyone a solution to this one please? Occasionally we have a pdf submitted to the repository (with no source file to fall back on) that contains 'strange' characters. An attempt to index it with GSearch would return an error (in previous versions of Muradora) and the current Solr indexing in Muradora does the same thing, thus for example: <str name="indexErrors"> Error indexing file 'hull_9': ParseError at [row,col]:[862,16] Message: Character reference "�" is an invalid XML character. </str> <str name="indexErrors"> Error indexing file 'hull_590': ParseError at [row,col]:[33,50] Message: Character reference "" is an invalid XML character. </str> On the odd occasions that we feel the need to do a rebuild or re-index this causes us real problems. We have just "lost" five objects from the repository because they no longer appear in the Solr indexes. Sure, we can get them back with a little messing about but it is time-consuming. Does anyone have a robust solution to this please? Richard ___________________________________________________________________ Richard Green Manager, RepoMMan, RIDIR and REMAP Projects e-Services Integration Group www.hull.ac.uk/esig/repomman www.hull.ac.uk/ridir www.hull.ac.uk/remap http://edocs.hull.ac.uk ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users ------------------------------ ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ------------------------------ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users End of Fedora-commons-users Digest, Vol 21, Issue 15 **************************************************** ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
