I suggest looking at Rome's [1] XMLReader [2][3][4][5]. For instance, it can
be used like this...
InputSource inputSource = new InputSource(url.toExternalForm());
try {
XmlReader reader = new XmlReader(url);
inputSource.setCharacterStream(reader);
inputSource.setEncoding(reader.getEncoding());
} catch (XmlReaderException xre) {
//This is somewhat unlikely to happen, but doesn't hurt to have
//extra fallback, which XmlReader conveniently allows for by
//providing access to the original unconsumed inputstream via
//the XmlReaderException
inputSource.setByteStream(xre.getInputStream());
String encoding = xre.getBomEncoding();
if (encoding == null) encoding = xre.getXmlGuessEncoding();
if (encoding == null) encoding = xre.getXmlEncoding();
inputSource.setEncoding(encoding != null ? encoding : "UTF-8");
}
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
parser.parse(inputSource);
[1] https://rome.dev.java.net/
[2]
https://rome.dev.java.net/apidocs/1_0/com/sun/syndication/io/XmlReader.html
[3]
https://rome.dev.java.net/source/browse/rome/src/java/com/sun/syndication/io/XmlReader.java?rev=1.19&view=markup
[4]
https://rome.dev.java.net/apidocs/1_0/com/sun/syndication/io/XmlReaderException.html
[5]
https://rome.dev.java.net/source/browse/rome/src/java/com/sun/syndication/io/XmlReaderException.java?rev=1.1&view=markup
Jake
On Fri, 24 Apr 2009 12:16:29 -0400
Michael Glavassevich <mrgla...@ca.ibm.com> wrote:
Hi Elliotte,
I had a peek at your article and see in the code snippets that what you're
calling the "actual encoding" or "real encoding" actually isn't. The one
passed to startDocument() in XNI is the auto-detected encoding, the one
which Xerces guessed by peeking at the first few bytes in the document. The
actual encoding may not be known until the XML declaration has been read
and at this point it hasn't been read yet.
In SAX it's not legal to read from the Locator in startDocument() so any
calls to the Locator you make in that method may not be correct and
generally won't be with Xerces because at the point it calls
startDocument() it hasn't read enough of the document yet to be sure of
what the actual encoding is. If it looked like it was working you were
probably just getting lucky because the documents you tried were in UTF-8
or UTF-16. Specifically the Javadoc [1] says: "Note that the locator will
return correct information only during the invocation SAX event callbacks
after startDocument returns and before endDocument is called. The
application should not attempt to use it at any other time." So you have to
wait until an event following startDocument() before you can read the
encoding (or anything else) from the Locator.
Thanks.
[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator(org.xml.sax.Locator)
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org
Elliotte Harold <elh...@ibiblio.org> wrote on 04/24/2009 08:48:52 AM:
Do you want the declared encoding or the real encoding? If the
latter, see here:
http://www.ibm.com/developerworks/library/x-tipsaxxni/
--
Elliotte Rusty Harold
elh...@ibiblio.org
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org