Hi,

I found a "feature" related to the SHA-1 message digest that is stored in 
XmlDocumentProperties when parsing an InputStream together with the 
LOAD_STRIP_WHITESPACE option. The digest seems to be calculated over the 
unstripped XML while producing a stripped XML.

This might be related to usage of DigestInputStream in the method "parse ( 
InputStream jiois, SchemaType type, XmlOptions options )" in the class 
SchemaTypeLoaderBase, because the message digest is automatically calculated 
when read from DigestInputStream, no matter if the read byte is stripped or not 
afterwards.

Other stripping XmlOptions might have this "feature" as well, although I havn't 
verified it.

In the sample below shows the behavoir, Digest 1 and Digest 2 are equal, while 
Digest 3 differs. As I see it, the result should be to have Digest 2 and 3 
equal, differing from Digest 1.


   String input = ""
      + "<!DOCTYPE doc [<!ATTLIST e9 attr CDATA \"default\">]>\n"
      + "<!-- Comment 2 --><doc>\n"
      + "   <e1   />\n"
      + "   <e2   ></e2>\n"
      + "   <e3    name = \"elem3\"   id=\"elem3\"    />\n"
      + "   <e4    name=\"elem4\"   id=\"elem4\"    ></e4>\n"
      + "   <e5 a:attr=\"out\" b:attr=\"sorted\" attr2=\"all\" attr=\"I'm\"\n"
      + "       xmlns:b=\"http://www.ietf.org\"\n";
      + "       xmlns:a=\"http://www.w3.org\"\n";
      + "       xmlns=\"http://example.org\"/>\n"
      + "   <e6 xmlns=\"\" xmlns:a=\"http://www.w3.org\";>\n"
      + "       <e7 xmlns=\"http://www.ietf.org\";>\n"
      + "           <e8 xmlns=\"\" xmlns:a=\"http://www.w3.org\";>\n"
      + "               <e9 xmlns=\"\" xmlns:a=\"http://www.ietf.org\"/>\n"
      + "               <text>&#169;</text>\n"
      + "           </e8>\n"
      + "       </e7>\n"
      + "   </e6>\n"
      + "</doc><!-- Comment 3 -->\n";
                
   // Calculate digest over original message
   try {
      MessageDigest md = MessageDigest.getInstance("SHA1");
      DigestInputStream in = new DigestInputStream(
         new ByteArrayInputStream( input.getBytes() ), md);
      byte[] buffer = new byte[8192];       
      while (in.read(buffer) != -1) ;
      byte[] raw = md.digest();
      System.out.println( "Digest 1: " + new String( raw ) );   // Digest of 
original XML, including whitespaces
   } catch( Exception e ) {
       e.printStackTrace();
       System.exit( -1 );
   }
        
   // Parse XML with whitespace stripping and message digest options set
   XmlOptions options = new XmlOptions();
   options.setLoadStripWhitespace();
   options.setLoadMessageDigest();
   XmlObject xo = null;
   try {
      xo = XmlObject.Factory.parse( new ByteArrayInputStream( input.getBytes() 
), options );
   } catch ( XmlException e ) {
      e.printStackTrace();
      System.exit(-1);
   } catch( IOException e ) {
      e.printStackTrace();
      System.exit(-1);
   }
   System.out.println( "Digest 2: " + new String( 
xo.documentProperties().getMessageDigest() ) );   // Digest of parsed XML
        
   // Calculate digest over parsed XML  
   try {
      MessageDigest md = MessageDigest.getInstance("SHA1");
      DigestInputStream in = new DigestInputStream( xo.newInputStream(), md);
      byte[] buffer = new byte[8192];       
      while (in.read(buffer) != -1) ;
      byte[] raw = md.digest();
      System.out.println( "Digest 3: " + new String( raw ) );   // Digest of 
parsed XML, excluding whitespaces
   } catch( Exception e ) {
       e.printStackTrace();
       System.exit( -1 );
   }
   

An obvious workaround is to manually calculate the message digest, after the 
parsing. However, it is better to have the digest being calculated during the 
parsing from a performance perspective, since otherwise you have to run over 
the XML twice.

What do you think of this, is this wanted or unwanted behaviour?
   
Cheers
>> Sami Mäkelä
Heimore Group



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to