I'v run across this problem too, not in this context but with another
data/repository type application. I came across this suggestion on the net:

"Atlassian made a command line XML cleaner that may suit your needs (it was
made mainly for JIRA but XML is XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS console or shell, and locate the XML or ZIP backup file on your
computer, here assumed to be called data.xml

Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

This will write a copy of data.xml to data-clean.xml, with invalid
characters removed.
"

I'v seen quite a few recommendations for it. I wonder if it might be worth a
try for your problem?  


Peri Stracchino
Digital Library Team
University of York
ext 4082 
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] 
Sent: 19 November 2008 17:25
To: [email protected]
Subject: Fedora-commons-users Digest, Vol 21, Issue 15

Send Fedora-commons-users mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
or, via email, send a message with subject or body 'help' to
        [EMAIL PROTECTED]

You can reach the person managing the list at
        [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Fedora-commons-users digest..."


Today's Topics:

   1. Re: Ingesting "newborn" digital objects Vs        only
      long-term-preservation ones (Filipe Correia)
   2. Indexing errors (Richard Green)
   3. Re: Indexing errors (Benjamin O'Steen)
   4. Re: Indexing errors (Steve Bayliss)


----------------------------------------------------------------------

Message: 1
Date: Tue, 18 Nov 2008 19:53:36 +0000
From: "Filipe Correia" <[EMAIL PROTECTED]>
Subject: Re: [Fedora-commons-users] Ingesting "newborn" digital
        objects Vs      only long-term-preservation ones
To: "Uwe Klosa" <[EMAIL PROTECTED]>
Cc: [email protected]
Message-ID:
        <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=ISO-8859-1

I find that very interesting, as I was not aware there were other
established certification requirements, besides the one I mentioned, by the
OCLC.

I will search for that info, but will also appreciate any url you may
provide about those other certification efforts. I hope they may better be
in sync with our particular scenario.


Thank you once again!

Best regards,
Filipe Correia


On Tue, Nov 18, 2008 at 6:50 PM, Uwe Klosa <[EMAIL PROTECTED]> wrote:
> We did have plans of a repository certification few years ago. This is 
> quite a dragging process and we do not have the man power to carry it 
> out. We did carry out a pre-study 2 years where we compared the German 
> and Australian concepts and rules for certification of repository 
> servers. After that has nothing happened here in Sweden in this area as
far as I am aware.
>
> Kind regards
> Uwe
>
> On Tue, Nov 18, 2008 at 7:37 PM, Filipe Correia <[EMAIL PROTECTED]>
wrote:
>>
>> Hello Uwe,
>>
>> Thank you for your reply!
>> Your scenario is indeed one I can relate to. Do you have any plans 
>> for repository certification, on a long-term preservation 
>> perspective? The preservation checklist I've mentioned appears to me 
>> as not being entirely compatible with the use of a repository on a 
>> documents production phase, rather only on a phase when all documents 
>> are defined for "permanent" preservation.
>>
>> Thanks,
>> Filipe Correia
>>
>> On Sun, Nov 16, 2008 at 1:08 PM, Uwe Klosa <[EMAIL PROTECTED]> wrote:
>> > In DiVA we're using Fedora as an repository for two purposes. The 
>> > first one is to archive publication metadata and files. The second 
>> > one is two support a publication workflow for different types of 
>> > publications. We do have mainly a publication workflow for doctoral 
>> > theses where the digital object is the original thesis. Furthermore 
>> > the system is used by an Open Access journal where also the digital 
>> > objects are the orignals.
>> >
>> > I think there have been discussions to use Fedora behind a CMS. But 
>> > I do not know if there is one out there.
>> >
>> > Regards
>> > Uwe Klosa
>> >
>> > On Sat, Nov 15, 2008 at 9:07 PM, Filipe Correia 
>> > <[EMAIL PROTECTED]>
>> > wrote:
>> >>
>> >> Dear Fedora Commons community,
>> >>
>> >> We are currently studying the best approach for an institutional 
>> >> repository using fedora, but are finding some difficulties.
>> >>
>> >> It's easy to find fedora case studies on the Web, but our scenario 
>> >> is somewhat different from all the others we are finding, even 
>> >> though we find it hard to believe we are the firsts with such a use
case.
>> >>
>> >> Let me explain. We are trying to address two main concerns:
>> >>  - Nowadays, all of our new documents are digital ones --- even if 
>> >> they are paper documents, they are turned into a digital form when 
>> >> they enter our institution. All of these documents are currently 
>> >> stored on a very archaic repository, which doesn't provide us with 
>> >> the control access we would like, doesn't properly handle unique 
>> >> object identification, and doesn't really scale to a much larger 
>> >> number of files.
>> >>  - The long term preservation of these objects hasn't really been 
>> >> thought out before, but we want to start taking it into account.
>> >>
>> >> So, our thoughts were to start using fedora, and ingesting digital 
>> >> objects from the moment they appear on the organization. Our 
>> >> document management system (which handles the document workflow) 
>> >> would make use of the underlying fedora infrastructure by 
>> >> maintaining references to the appropriate digital objects. Right 
>> >> on the moment our digital objects were "born" they wouldn't have 
>> >> much associated metadata, but it would grow as the object would be 
>> >> further used, throughout the organization.
>> >>
>> >> At the same time, taking our long term preservation concerns into 
>> >> account, we have been looking at the OAIS model and at this 
>> >> repository certification checklist: 
>> >> http://bibpurl.oclc.org/web/16712 These sound to us as very wise 
>> >> advises but, at the same time, poses some
>> >> doubts:
>> >>  - For certification purposes, the repository would have to 
>> >> maintain
>> >> *only* long-term-preservation objects, but if we ingest "newborn"
>> >> objects, that will not be the case. Only later the objects will be 
>> >> evaluated and considered worthy of long-term preservation, or not 
>> >> --- in which case they can be discarded.
>> >>  - For our day to day administrative activities it makes perfect 
>> >> sense to use a repository since the moment the document is 
>> >> created...  does this mean we will have to have a second 
>> >> repository to which to copy the digital objects when the time 
>> >> comes? (this seems silly to us... Is
>> >> it?)
>> >>
>> >>
>> >> If someone has faced a similar scenario, we would really love to 
>> >> hear your take on this matter. It seems to us that repository 
>> >> models are being thought out mainly for end-of-the-line archiving 
>> >> (that is, when they have served their main administrative 
>> >> purposes), but we think it could be very useful to use them sooner on
the document life-cycle.
>> >>
>> >>
>> >> Thank you in advance!
>> >>
>> >> Regards,
>> >> Filipe Correia
>> >>
>> >>
>> >> ------------------------------------------------------------------
>> >> ------- This SF.Net email is sponsored by the Moblin Your Move 
>> >> Developer's challenge Build the coolest Linux based applications 
>> >> with Moblin SDK & win great prizes Grand prize is a trip for two 
>> >> to an Open Source event anywhere in the world 
>> >> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> >> _______________________________________________
>> >> Fedora-commons-users mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> >
>> >
>
>



------------------------------

Message: 2
Date: Wed, 19 Nov 2008 11:09:00 -0000
From: "Richard Green" <[EMAIL PROTECTED]>
Subject: [Fedora-commons-users] Indexing errors
To: <[email protected]>,       "drama"
        <[EMAIL PROTECTED]>
Message-ID:
        <[EMAIL PROTECTED]>
Content-Type: text/plain; charset="us-ascii"

Has anyone a solution to this one please?

 

Occasionally we have a pdf submitted to the repository (with no source file
to fall back on) that contains 'strange' characters.  An attempt to index it
with GSearch would return an error (in previous versions of
Muradora) and the current Solr indexing in Muradora does the same thing,
thus for example:

 

<str name="indexErrors">

Error indexing file 'hull_9': ParseError at [row,col]:[862,16]

Message: Character reference "&#0" is an invalid XML character.

</str>

 

<str name="indexErrors">

Error indexing file 'hull_590': ParseError at [row,col]:[33,50]

Message: Character reference "&#11" is an invalid XML character.

</str>

 

On the odd occasions that we feel the need to do a rebuild or re-index this
causes us real problems.  We have just "lost" five objects from the
repository because they no longer appear in the Solr indexes.  Sure, we can
get them back with a little messing about but it is time-consuming.
Does anyone have a robust solution to this please?

 

Richard

 

___________________________________________________________________

 

Richard Green

Manager, RepoMMan, RIDIR and REMAP Projects

e-Services Integration Group

 

www.hull.ac.uk/esig/repomman

www.hull.ac.uk/ridir

www.hull.ac.uk/remap

http://edocs.hull.ac.uk

 

-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available

------------------------------

Message: 3
Date: Wed, 19 Nov 2008 14:55:39 -0000
From: "Benjamin O'Steen" <[EMAIL PROTECTED]>
Subject: Re: [Fedora-commons-users] Indexing errors
To: "Richard Green" <[EMAIL PROTECTED]>,
        <[email protected]>,   "drama"
        <[EMAIL PROTECTED]>
Message-ID:
        <[EMAIL PROTECTED]>
Content-Type: text/plain;       charset="iso-8859-1"


Well, to cut to the chase, you just have to silently drop the bad output
from the pdf. You could attempt to spend a lot of time and resources
tracking down why things are broken, but it would be a never-ending task.
For example, some type-setters have a tendancy to replace fi with a single
character, just so that on the pdf, the kerning of those letters look good.
Of course, it is fairly meaningless to a pdftotxt type application, which is
why pdf is the devil ;)

(And I've spotted many more than one way that the special kerned
'characters', some using custom fonts, not using a standard mapping of
course. I've mostly given up on this, and have to accept the adage of
'garbage in, garbage out' - not great, but I don't have the resources to do
better.)

Ben

-----Original Message-----
From: Richard Green [mailto:[EMAIL PROTECTED]
Sent: Wed 11/19/2008 11:09 AM
To: [email protected]; drama
Subject: [Fedora-commons-users] Indexing errors
 
Has anyone a solution to this one please?

 

Occasionally we have a pdf submitted to the repository (with no source file
to fall back on) that contains 'strange' characters.  An attempt to index it
with GSearch would return an error (in previous versions of
Muradora) and the current Solr indexing in Muradora does the same thing,
thus for example:

 

<str name="indexErrors">

Error indexing file 'hull_9': ParseError at [row,col]:[862,16]

Message: Character reference "&#0" is an invalid XML character.

</str>

 

<str name="indexErrors">

Error indexing file 'hull_590': ParseError at [row,col]:[33,50]

Message: Character reference "&#11" is an invalid XML character.

</str>

 

On the odd occasions that we feel the need to do a rebuild or re-index this
causes us real problems.  We have just "lost" five objects from the
repository because they no longer appear in the Solr indexes.  Sure, we can
get them back with a little messing about but it is time-consuming.
Does anyone have a robust solution to this please?

 

Richard

 

___________________________________________________________________

 

Richard Green

Manager, RepoMMan, RIDIR and REMAP Projects

e-Services Integration Group

 

www.hull.ac.uk/esig/repomman

www.hull.ac.uk/ridir

www.hull.ac.uk/remap

http://edocs.hull.ac.uk

 





------------------------------

Message: 4
Date: Wed, 19 Nov 2008 17:24:47 -0000
From: "Steve Bayliss" <[EMAIL PROTECTED]>
Subject: Re: [Fedora-commons-users] Indexing errors
To: "'Benjamin O'Steen'" <[EMAIL PROTECTED]>,   "'Richard
        Green'" <[EMAIL PROTECTED]>,
        <[email protected]>,   "'drama'"
        <[EMAIL PROTECTED]>
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain;       charset="UTF-8"

For the case of ligatures such as fi -> ? , fl -> ?, ae -> ? etc I'd suggest
a better solution would be for GSearch/Solr to be able to cope with them.
Though Richard's examples are not ligatures.

Although no longer as common as they used to be (mainly due to typesetting
being put into the hands of the masses with the rise of personal computers
and word processing), PDF is not the only place where ligatures are likely
to crop up, eg TeX output will substitute ligatures automatically, and
output from DTP packages may also contain them - so as the scope of archived
material broadens there will be more of a need to deal with these cases.

There are valid Unicode characters for these, though of course cases where
custom fonts or other non-standard techinques are used makes the problem a
lot more difficult.  And availability of ligatures varies between fonts.

(But I agree that pdf is the devil, but for many other reasons!).

 Solr can be configured to use various filters to cope with various weird
and wonderful characters - this blog post
http://blog.tremend.ro/2007/08/28/create-a-solr-filter-that-replace-diacriti
cs/ seems to suggest you can implement your own custom filters, though there
may be ready-made Solr filters that you could use.

Richard's examples below however are not ligatures, they are Unicode NUL and
VT (vertical tab) - and the issue here is that these characters (escaped)
are not valid XML.  The only characters below 20h (32) that are valid in XML
are 09h, 0Ah (10) and 0Dh (13) - tab/cr/lf.

There's a post about it here:
http://www.nabble.com/invalid-XML-character-td15781177.html

If there's any place you can hook in before the XML is parsed, you could try
and remove all the escaped characters that are not valid in XML at this
point - there's a reference here that lists what characters are valid in
XML: http://www.w3.org/TR/REC-xml/#charsets.  Alternatively the code for
whatever is producing the XML needs modifying to drop characters that are
not valid in XML (ie to produce only valid XML!).

Steve

-----Original Message-----
From: Benjamin O'Steen [mailto:[EMAIL PROTECTED]
Sent: 19 November 2008 14:56
To: Richard Green; [email protected]; drama
Subject: Re: [Fedora-commons-users] Indexing errors



Well, to cut to the chase, you just have to silently drop the bad output
from the pdf. You could attempt to spend a lot of time and resources
tracking down why things are broken, but it would be a never-ending task.
For example, some type-setters have a tendancy to replace fi with a single
character, just so that on the pdf, the kerning of those letters look good.
Of course, it is fairly meaningless to a pdftotxt type application, which is
why pdf is the devil ;)

(And I've spotted many more than one way that the special kerned
'characters', some using custom fonts, not using a standard mapping of
course. I've mostly given up on this, and have to accept the adage of
'garbage in, garbage out' - not great, but I don't have the resources to do
better.)

Ben

-----Original Message-----
From: Richard Green [mailto:[EMAIL PROTECTED]
Sent: Wed 11/19/2008 11:09 AM
To: [email protected]; drama
Subject: [Fedora-commons-users] Indexing errors

Has anyone a solution to this one please?



Occasionally we have a pdf submitted to the repository (with no source file
to fall back on) that contains 'strange' characters.  An attempt to index it
with GSearch would return an error (in previous versions of
Muradora) and the current Solr indexing in Muradora does the same thing,
thus for example:



<str name="indexErrors">

Error indexing file 'hull_9': ParseError at [row,col]:[862,16]

Message: Character reference "&#0" is an invalid XML character.

</str>



<str name="indexErrors">

Error indexing file 'hull_590': ParseError at [row,col]:[33,50]

Message: Character reference "&#11" is an invalid XML character.

</str>



On the odd occasions that we feel the need to do a rebuild or re-index this
causes us real problems.  We have just "lost" five objects from the
repository because they no longer appear in the Solr indexes.  Sure, we can
get them back with a little messing about but it is time-consuming. Does
anyone have a robust solution to this please?



Richard



___________________________________________________________________



Richard Green

Manager, RepoMMan, RIDIR and REMAP Projects

e-Services Integration Group



www.hull.ac.uk/esig/repomman

www.hull.ac.uk/ridir

www.hull.ac.uk/remap

http://edocs.hull.ac.uk





-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes Grand prize is a trip for two to an Open Source event anywhere in the
world http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Fedora-commons-users mailing list [email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users




------------------------------

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes Grand prize is a trip for two to an Open Source event anywhere in the
world http://moblin-contest.org/redirect.php?banner_id=100&url=/

------------------------------

_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


End of Fedora-commons-users Digest, Vol 21, Issue 15
****************************************************


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to