i try to parse metadata records with the apache commons digester, which uses xerces.
unfortunately, almost all that metadata is declared as UTF-8, which causes a
java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)ava.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
when i try to read an xml file such as the one attached below.
in the archive i found a hint, which recommends to change the encoding to ISO-8859-1, but , of course, this does not help if done at digestion time.
Any suggestions?
kind regards thomas
<?xml version="1.0" encoding="utf-8"?> <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Medienphilosophie(n)</dc:title> <dc:creator>Hartmann, Dr. Frank</dc:creator> <dc:subject>Medienphilosophie, Theorie der Virtualit�t, Cyberphilosophie</dc:subject> <dc:description>Die Frage, ob
...
wird, aufl�sen wird lassen. Eine Rekonstruktion relevanter
Positionen.</dc:description> <dc:date>2002-01-01</dc:date>
<dc:type>Book Chapter</dc:type>
<dc:identifier>http://sammelpunkt.philo.at:8080/archive/00000103/</dc:identifier> <dc:format>html http://sammelpunkt.philo.at:8080/archive/00000103/01/medienphilosophie.html</dc:format></oai_dc:dc>
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
