Hi John,
I know the API, and I was planning on reusing it by changing ReaderMgr from
if (src.getEncoding())
{
retVal = new (fMemoryManager) XMLReader
(
src.getPublicId()
, src.getSystemId()
, newStream
, src.getEncoding()
to
const XMLCh* encoding = src.getEncoding();
if(encoding == 0)
encoding = newStream->getContentType();
if (encoding)
{
retVal = new (fMemoryManager) XMLReader
(
src.getPublicId()
, src.getSystemId()
, newStream
, encoding
i.e. if the InputSource doesn't have a user-specified encoding, check if
the actual stream carries an encoding.
However, the getContentType returns the full header value, e.g.
"application/xhtml+xml; charset=koi8-r", instead of an encoding; I
guess you need getContentType to stay the same for supporting XQilla's
unparsed-text(), so I was inclined to add a getEncoding method to
BinInputStream.
Alberto
Il 13/06/2011 12:04, John Snelson ha scritto:
Hi Alby,
I added BinInputStream::getContentType() some time ago so that I could
accomplish this kind of thing in XQilla. My guess is that you can build
Xerces-C stream encoding support on top of this. InputSource currently
has a getEncoding() method, but the HTTP call hasn't been made by this
point - maybe BinInputStream also needs a getEncoding() method which
takes it's default from the InputSource?
John
On 09/06/11 13:44, Alberto Massari (JIRA) wrote:
[
https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046508#comment-13046508
]
Alberto Massari commented on XERCESC-1967:
------------------------------------------
I don't agree on your request of reversing the priorities, but that's a
discussion that shouldn't be done here. Good luck in trying to convince W3C.
The XML spec says that the BOM+internal encoding have the precedence when the
XML is in a *file*, because it is likely that no transcoding has been performed
on top of it. For all the other scenarios (when the XML is in a byte stream)
the component that does the wrapping should take care of telling the parser the
new setting. This is what Xerces is doing now, and in my opinion it's correct
and shouldn't be changed.
What is missing in Xerces is the capability of propagating the content-type
read from the HTTP stream to the parser; whether the content type is text/xml
vs application/xml, this is simply affecting what is the default encoding when
the content-type is not specified. And in case 8.20 there is an encoding
specified, so it doesn't matter which one (text/xml or application/xml) was
specified.
In short, if you think that pparse (or saxcount) should refuse to parse your
web page (that has an HTTP content type specifying Korean, plus an UTF-8 BOM),
I agree and I will try to fix it.
Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the
charset parameter of the HTTP content-type: header
--------------------------------------------------------------------------------------------------------------------------------
Key: XERCESC-1967
URL: https://issues.apache.org/jira/browse/XERCESC-1967
Project: Xerces-C++
Issue Type: Bug
Components: Non-Validating Parser
Affects Versions: 3.1.1
Environment: Mac OS X Snow Leopard (Intel).
(http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
And also tested the XMLmind XML editor on same platorm.
Reporter: Leif Halvard Silli
Original Estimate: 4h
Remaining Estimate: 4h
[1] http://www.w3.org/mid/[email protected]
[2] http://www.w3.org/mid/[email protected]
It is a XML 1.0 spec vioation. well-formed violation.
Test cases without XML declaration: http://malform.no/testing/html5/bom/
Test cases *with* XML declartion to be added later.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]