Re: Fwd: Why does Xerces modify an invalid XML file while parsing?

neetha patil Wed, 18 Jul 2012 04:01:57 -0700

Dear Alberto,

Thank you for your patience and for all the valuable information.


One final question: The link mentioned in my previous mail describes that
by setting 'XMLUni::fgXercesContinueAfterFatalError' to true, the parser's
behavior might  be *undetermined.* Is the auto-modification (which is being
discussed) one such behaviour? Also I would be grateful if you could
briefly explain other such behaviours.

Regards,
Neetha


On Wed, Jul 18, 2012 at 1:49 PM, Alberto Massari <
[email protected]> wrote:

>  Il 17/07/2012 08:21, neetha patil ha scritto:
>
> Dear All,
>
>  Thank you Alberto for guiding me to get rid of the "Unknown element"
> validation errors.
>
> I tried setting the parameter 'XMLUni::fgDOMErrorHandler' for the
> DOMBuilder parser but there it had no such parameter and also I am using
> the DOM document which is returned after parsing.
>
>
> I forgot that in the new DOM L3 the parameters are set through an
> intermediate object. The correct call should be something like
> (*pParser)->getDOMConfiguration()->setParameter. (Double check it, I could
> remember the name wrong, but that should give you the idea)
>
>
>
> DOMBuilder parser (while parsing against the schema) reports  the
> first schema-related error and continues with further parsing and reporting
> of other schema-related errors (if any). Is it possible for the 
> DOMBuilderparser to behave in the same way (and not do any auto-modification) 
> when
> there are invalid XML statement(s) like the one reported in my previous
> mail?
>
>
> No; validation errors are not fatal while invalid XML syntax could be
> non-recoverable. In your case the parser tries to find a new
> synchronization point at the first ">" it finds, but if you missed the
> closing quote at the end of an attribute you would be in much bigger
> troubles.
>
> What I am trying to make you understand is that an invalid XML cannot
> generate a DOM representation that reflects the input XML, because by
> serializing a DOM representation you will get a *valid* XML, not the
> original invalid one. The correct thing to do is reject the input XML you
> got; if you want to still be able to read and manipulate it, what you call
> "auto-modification" is the only thing you can do.
>
> Alberto
>
>
>
>
> Regards,
> Neetha
>
> On Mon, Jul 16, 2012 at 5:03 PM, Alberto Massari <
> [email protected]> wrote:
>
>>  Hi Neetha,
>> the correct thing to do would be to not make these calls
>>
>>
>>              (*pParser)->setFeature( XMLUni::fgXercesSchema, true );
>>               (*pParser)->setFeature( XMLUni::fgXercesSchemaFullChecking,
>> true );
>>               (*pParser)->setFeature( XMLUni::fgDOMValidation, true);
>>               (*pParser)->setFeature(
>> XMLUni::fgXercesCacheGrammarFromParse, true);
>>
>> when bValidate == false, as you are asking to validate against a schema
>> that you are not going to provide. This will remove the "Unknown element"
>> validation errors. As for what you say it's an "auto-modification", it's
>> the correct behaviour: <name="abc"> is not a valid XML statement (either
>> there is a missing tag name, and "name" is an attribute, or "name" is the
>> element and it's missing a space followed by the attribute name. If you
>> force the parser to continue, the DOM tree you get back will be incomplete,
>> at best.
>> If you really want to get a DOM tree out of that invalid XML, you could
>> attach a W3C DOMErrorHandler (different from the one you provided) using
>> (*pParser)->setParameter(XMLUni::fgDOMErrorHandler, domErrorHandlerVar)
>> This class has a handleError method where you can check what happened by
>> examining the DOMError argument, and the DOMLocation inside it (it contains
>> the DOM node where the error was located). If you return "true", the parser
>> will try continuing the parse process; if you return "false", parsing will
>> be aborted.
>>
>> Alberto
>>
>>
>> Il 16/07/2012 12:06, neetha patil ha scritto:
>>
>> Dear Alberto,
>>
>> Thank you for the quick reply.
>>
>> As I do not load the grammar (schema) to the parser, it gives error like
>> "Unknown element.." etc., for all the XML tags until it hits the invalid
>> tag for which it gives the error 'Expected an attribute name' and aborts
>> parsing as you mentioned.
>>
>> So I set the feature 'XMLUni::fgXercesContinueAfterFatalError' to true
>> and got the complete file parsed. However the line containing the invalid
>> tag was modified as follows:-
>> ...
>> ...
>>  <Services>
>>      ...
>>      ...
>> </Services>
>> ...
>> <name>
>> ...
>> ...
>> </name>
>> ...
>> ...
>>
>> As it is told in http://xml.apache.org/xerces-c-new/program-dom.html that
>> setting this feature to true might result in an *undetermined* behavior
>> of the parser, is there any other way for the parser to report the error
>> and continue parsing? Also can we prevent the auto-modification (in this
>> case, the modification from <name="abc"> to <name>)?
>>
>> Thanks
>>
>> Regards,
>> Neetha
>>
>> On Mon, Jul 16, 2012 at 2:39 PM, Alberto Massari <
>> [email protected]> wrote:
>>
>>>  Hi,
>>> Xerces doesn't modify your document; you should check the error handler
>>> to see if the parsing was aborted because of an error. In this case the
>>> returned DOM tree would be complete up to position of the error.
>>>
>>> Alberto
>>>
>>> Il 16/07/2012 10:25, neetha patil ha scritto:
>>>
>>>  Dear All,
>>>
>>> I am using Xercesc_2_8 C++. I provide a XML file (containing an invalid
>>> tag) to the
>>> DOMBuilder parser. I then edit the DOM document which is generated and
>>> save the document back to the XML file. The content of this file is now
>>> truncated from the invalid tag onwards. Why does the parser modify the file
>>> while parsing? How do I prevent the same? i.e., I want the parser to report
>>> the error and continue parsing but not modify the XML content.
>>> Following is the snapshot of the XML file:-
>>> ...
>>> ...
>>> <Header id="My Project Id" nameStructure="DevName" revision="0"
>>> version="1">
>>>      ...
>>> </Header>
>>>      ...
>>>      ...
>>> <Services>
>>>      ...
>>>      ...
>>> </Services>
>>> <!-- Invalid tag: No node name -->
>>> <name="abc">
>>> ...
>>> ...
>>>  Following is the code snippet of the parser:-
>>> *void CHelper::InitDOM()
>>> *{
>>>         // m_pDomImpl is a pointer to DOMImplementation
>>>         m_pDomImpl = 0;
>>>         if(m_pDomImpl == NULL)
>>>         {
>>>               XMLPlatformUtils::Initialize();
>>>               m_pDomImpl =
>>> DOMImplementationRegistry::getDOMImplementation( gLS );
>>>          }
>>> }
>>>
>>> *int CHelper::LoadFile(DOMBuilder** pParser, const CString& strXMLFile,
>>> DOMDocument** pDoc, CStringArray&     arrError, bool bValidate, const
>>> CString& strSchemaFile)
>>> *{
>>>        ...
>>>        if(*pParser == NULL)
>>>        {
>>>               *pParser =
>>> ((DOMImplementationLS*)m_pDomImpl)->createDOMBuilder
>>>                                                                             
>>>                      (DOMImplementationLS::MODE_SYNCHRONOUS,
>>>  0 );
>>>                if((*pParser) ==NULL)
>>>               {
>>>                     return DOM_INITIALIZE_FAILED;
>>>               }
>>>
>>>               (*pParser)->setFeature( XMLUni::fgDOMNamespaces, true );
>>>               (*pParser)->setFeature( XMLUni::fgXercesSchema, true );
>>>               (*pParser)->setFeature(
>>> XMLUni::fgXercesSchemaFullChecking, true );
>>>               (*pParser)->setFeature( XMLUni::fgDOMValidation, true);
>>>               (*pParser)->setFeature(
>>> XMLUni::fgXercesCacheGrammarFromParse, true);
>>>        }
>>>
>>>        try
>>>        {
>>>               CMyDOMErrHandler eh();
>>>               m_arrValidationErrs.RemoveAll();
>>>
>>>               // parseURI a blocking call. All the errors will be
>>> reported first if any error handler is set
>>>               // then only the next line will be executed.
>>>               if(bValidate == true)
>>>              {
>>>                    (*pParser)->setErrorHandler(&eh);
>>>                    (*pParser)->loadGrammar( strSchemaFile,
>>> Grammar::SchemaGrammarType, true);
>>>              }
>>>              else
>>>              {
>>>                     (*pParser)->setErrorHandler(NULL);
>>>              }
>>>              *pDoc =(*pParser)->parseURI(strXMLFile);
>>>              ...
>>>              ...
>>>       }
>>>       catch(...)
>>>       {
>>>             ...
>>>       }
>>>
>>>       return SUCCESS;
>>>
>>> }
>>>
>>> Thank you in advance.
>>> Regards,
>>> Neetha
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

Re: Fwd: Why does Xerces modify an invalid XML file while parsing?

Reply via email to