That sounds like a good plan :-).

On 13/06/11 11:18, Alberto Massari wrote:
Hi John,
I know the API, and I was planning on reusing it by changing ReaderMgr from

if (src.getEncoding())
{
retVal = new (fMemoryManager) XMLReader
(
src.getPublicId()
, src.getSystemId()
, newStream
, src.getEncoding()

to

const XMLCh* encoding = src.getEncoding();
if(encoding == 0)
encoding = newStream->getContentType();
if (encoding)
{
retVal = new (fMemoryManager) XMLReader
(
src.getPublicId()
, src.getSystemId()
, newStream
, encoding

i.e. if the InputSource doesn't have a user-specified encoding, check if
the actual stream carries an encoding.

However, the getContentType returns the full header value, e.g.
‡"application/xhtml+xml; charset=koi8-r", instead of an encoding; I
guess you need getContentType to stay the same for supporting XQilla's
unparsed-text(), so I was inclined to add a getEncoding method to
BinInputStream.

Alberto


Il 13/06/2011 12:04, John Snelson ha scritto:
Hi Alby,

I added BinInputStream::getContentType() some time ago so that I could
accomplish this kind of thing in XQilla. My guess is that you can build
Xerces-C stream encoding support on top of this. InputSource currently
has a getEncoding() method, but the HTTP call hasn't been made by this
point - maybe BinInputStream also needs a getEncoding() method which
takes it's default from the InputSource?

John

On 09/06/11 13:44, Alberto Massari (JIRA) wrote:
[
https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046508#comment-13046508
]

Alberto Massari commented on XERCESC-1967:
------------------------------------------

I don't agree on your request of reversing the priorities, but that's
a discussion that shouldn't be done here. Good luck in trying to
convince W3C.
The XML spec says that the BOM+internal encoding have the precedence
when the XML is in a *file*, because it is likely that no transcoding
has been performed on top of it. For all the other scenarios (when
the XML is in a byte stream) the component that does the wrapping
should take care of telling the parser the new setting. This is what
Xerces is doing now, and in my opinion it's correct and shouldn't be
changed.
What is missing in Xerces is the capability of propagating the
content-type read from the HTTP stream to the parser; whether the
content type is text/xml vs application/xml, this is simply affecting
what is the default encoding when the content-type is not specified.
And in case 8.20 there is an encoding specified, so it doesn't matter
which one (text/xml or application/xml) was specified.

In short, if you think that pparse (or saxcount) should refuse to
parse your web page (that has an HTTP content type specifying Korean,
plus an UTF-8 BOM), I agree and I will try to fix it.


Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also
ignores the charset parameter of the HTTP content-type: header
--------------------------------------------------------------------------------------------------------------------------------


Key: XERCESC-1967
URL: https://issues.apache.org/jira/browse/XERCESC-1967
Project: Xerces-C++
Issue Type: Bug
Components: Non-Validating Parser
Affects Versions: 3.1.1
Environment: Mac OS X Snow Leopard (Intel).
(http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)

And also tested the XMLmind XML editor on same platorm.
Reporter: Leif Halvard Silli
Original Estimate: 4h
Remaining Estimate: 4h

[1]
http://www.w3.org/mid/[email protected]
[2]
http://www.w3.org/mid/[email protected]
It is a XML 1.0 spec vioation. well-formed violation.
Test cases without XML declaration:
http://malform.no/testing/html5/bom/
Test cases *with* XML declartion to be added later.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to