Writing /Outputting a BOM for UTF-16 documents in Xerces 2.4.0

dara Tue, 10 Jan 2006 08:03:17 -0800

Hi All,

I was trying to add this comment to JIRA issue[http://issues.apache.org/jira/browse/XERCESC-681] but it informs methat I do not have permission to comment on that item. Please accept mycomments here instead as I have just subscribed to the JIRA system.

This remark is in relation to Xerces-C++ 2.4.0 which I am currentlytesting some code with. I won't have time to test this with later sourceversions for quite a while.

Also, I did not find very much in the JIRA system or the mail archivesin relation to the writing of BOM, well I found some comment but nothingdefinitive on whether or not it works with Xerces yet. This bug entryalso appears non-definitive (unverified). So basically this may well bein the same state in later versions.


Regards

Dara

---BEGIN COMMENT -- JIRA issue =http://issues.apache.org/jira/browse/XERCESC-681 ---


Hi all,

I wasn't sure of the status of this so I have just run the DOMPrintexample in xerces-c++ 2.4.0 (linux) and have successfully output boththe big endian and little endian Byte Order Marks for UTF-16 output.

I was looking for an option to write the BOM and didn't find anythinghelpful in docs, mail lists, etc. until I came across this issue.

I am now using the code as per the DOMPrint example to set the optionon my writer when the BOM is required.


*** NOTE ***

One potential issue I noticed in my trials :

Generally, if the writer does note have an encoding set, but the DOM tobe written does (encodig or ActualEncoding is set), then the DOMencoding value is used.

However, when we activate the writing of a BOM, then the writing willfail with a segfault (for iconv at least) due to the following :



(i called writeNode() on the writer)

+In this snip, "fEncoding" is null. I assumed this was due to no writerencoding being set, and my tests appear to substantiate this.


--- 8< ---

void DOMWriterImpl::processBOM()
{
   // if the feature is not set, don't output bom
   if (!getFeature(BYTE_ORDER_MARK_ID))
       return;

if ((XMLString::compareIString(fEncoding,XMLUni::fgUTF16LEncodingString) == 0) ||(XMLString::compareIString(fEncoding,XMLUni::fgUTF16LEncodingString2) == 0) )


--- >8 ---

+ thus when we get here, the first parm is 0x0 and the second is (XMLCh)"UTF-16(LE)"


--- 8< ---

int XMLString::compareIString(  const   XMLCh* const    str1
                               , const XMLCh* const    str2)
{
   // Refer this one to the transcoding service
   return XMLPlatformUtils::fgTransService->compareIString(str1, str2);
}

--- >8 ---

+ thus we fail here while trying to de-reference cptr1 in the "while"statement.


--- 8< ---

//---------------------------------------------------------------------------

//  IconvTransService: The virtual transcoding service API

//---------------------------------------------------------------------------

int IconvTransService::compareIString(  const   XMLCh* const    comp1
                                       , const XMLCh* const    comp2)
{
   const XMLCh* cptr1 = comp1;
   const XMLCh* cptr2 = comp2;

while ( (*cptr1 != 0) && (*cptr2 != 0) )

   {
       wint_t wch1 = towupper(*cptr1);
       wint_t wch2 = towupper(*cptr2);
       if (wch1 != wch2)
           break;

cptr1++;

       cptr2++;
   }
   return (int) ( towupper(*cptr1) - towupper(*cptr2) );
}

--- >8 ---

If I set a writer encoding, regardless of whether or not a DOM encodingor actual encoding is set, then it looks like it's working fine.

I don't know if this is desired, but my assumptions would have led me tobelieve that if the writer has no encoding set and generally takes thenit's ecoding from the item to be written, it should also do so whencomparing encodings for the purposes of writing a BOM.....?


Regards

Dara

---END COMMENT -- JIRA issue =http://issues.apache.org/jira/browse/XERCESC-681 ---


--
Regards,

Dara Mulvihill,

Rísarís Ltd,

http://www.risaris.com

++353 404 64009





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Writing /Outputting a BOM for UTF-16 documents in Xerces 2.4.0

Reply via email to