Re: Status of == vs equals() RESULTS

2010-08-25 Thread eric fu
I agree with your decision based on test. It will be risky and will have
littele gain to use == for string comparison.

Eric

On Tue, Aug 24, 2010 at 2:11 PM, Chad La Joie laj...@itumi.biz wrote:

 Okay, I'll prepare a patch for you by the end of the week.


 On 8/24/10 2:23 PM, Colm O hEigeartaigh wrote:

 Sounds fine to me.

 Colm.

 On Mon, Aug 23, 2010 at 8:55 PM, Chad La Joielaj...@itumi.biz  wrote:

 Okay, getting back to this.

 I tried my tests again this time with:
  - a 7.5MB SAML metadata document (so lots of comparisons)
  - 100 warm up runs then 100 timed runs
  - an explicit GC between each run to keep it from happening during the
 runs
 since the DOMs were so large

 No real difference in results. equals() was faster.

 So, at this point, I can't see any reason to do anything other than
 equals().  It's the actual correct way of doing the comparison in that it
 will always return the proper result and the JVM definitely seems to be
 optimizing its use.

 On 8/10/10 7:53 AM, Chad La Joie wrote:


 Okay, I certainly have a number of SAML documents lying around so I'll
 try with those as well. And, of course, I'll report back the results I
 get.

 On 8/10/10 4:46 AM, Raul Benito wrote:


 As the original author of the changes of equals to == in intern
 namespaces, I can tell that original in 1.4 and 1.5 and with my data
 (that was the verification of a SAML/Liberty AuthnReq in a multi thread
 tests, and the old Juice JCE provider). The change was 10% to 20%
 faster.
 The SAML is one of the real example of signing and has some url with
 common prefixes and same length url.
 The Juice provider also helps to get rid of the signing/digest cost (a
 verification is two c14n one of the signing part and c14n of the
 signature), but i think just a c14n is a good way of measure it.
 Also take into account that the == vs equals debate is more a memory
 workload cache problem, if we have to iterate over and over every char
 just to see if it is not equals, we trash the cache (That's why i used
 the multi thread to simulate a server decoding requests with more or
 less the same code, but in different times and different workload)
 Nevertheless if you have test with a more modern jre and the code
 .equals is behaving better, just go ahead and kiss goodbye to the ==.

 Clive, using the .hashCode for strings in this case is not a big
 speed-up as it is going to go through all the chars of the string,
 trashing cache again, and multiplying and adding the result to an
 integer, instead of a fail in the first different char or just
 summarize
 to a boolean.\

 Regards,


 On Tue, Aug 10, 2010 at 2:37 AM, Clive Brettingham-Moore
 xml...@brettingham-moore.netmailto:xml...@brettingham-moore.net
 wrote:

 Have to agree .equals is the way to go, since correctness of == is too
 reliant on what must be considered implementation optimisations in the
 parser.

 Benchmarking in JVM is notoriously difficult, but it does look like
 there is no gross difference, which should kill any objections to doing
 it correctly.

 Since I recently spend far to long researching this for an unrelated
 problem I'll add my 10c to the detail discussion.

 On 10/08/10 01:23, Chad La Joie wrote:

  Not necessarily, there are a number of not equal checks in there that
 should, in theory, perform better if you only use == only. In such a
 case, the use of != will just be a single check while !equals() will
 result in a char-by-char comparison.


 Actually, the next thing String.equals tests is length equality - so
 character comparison will only be reached if the strings are the same
 length.

 Since the char by char comparison returns on the first mismatch, then
 only same length strings with shared prefixes will show the expected
 slowness. (namespace URIs are likely to share prefixes, but I think are
 not particularly likely to be the same length, unless actually
 equal)...
 thus String.equals is only likely to be slow where comparing long
 distinct but equal strings (so intern or alternative string pooling
 techniques needed for == benefit .equals without all the nasty
 loopholes: even if .equals is occasionally slow, at least it is always
 right).

 In circumstances where doing repeated tests with many length and prefix
 matches, adding a hash code inequality test ((s1.hashCode()==
 s2.hashCode())s1.equals(s2)) could prevent practically all
 char-by-char checks for !equal cases (but if the same strings are never
 repeatedly used, the hash code calculation could be an issue; nb intern
 results in hash calculation for all strings anyway)... pooling is still
 needed to speed up matches for equality though.

 Re VM options I would feel -server is definitely the right test bed,
 both because of the more aggressive JIT, and also because the code is
 likely to see heaviest real world cases in -server VMs.




 --
 Chad La Joie
 http://itumi.biz
 trusted identities, delivered



 --
 Chad La Joie
 http://itumi.biz
 trusted identities, delivered



Re: Status of == vs equals() RESULTS

2010-08-24 Thread Colm O hEigeartaigh
Sounds fine to me.

Colm.

On Mon, Aug 23, 2010 at 8:55 PM, Chad La Joie laj...@itumi.biz wrote:
 Okay, getting back to this.

 I tried my tests again this time with:
  - a 7.5MB SAML metadata document (so lots of comparisons)
  - 100 warm up runs then 100 timed runs
  - an explicit GC between each run to keep it from happening during the runs
 since the DOMs were so large

 No real difference in results. equals() was faster.

 So, at this point, I can't see any reason to do anything other than
 equals().  It's the actual correct way of doing the comparison in that it
 will always return the proper result and the JVM definitely seems to be
 optimizing its use.

 On 8/10/10 7:53 AM, Chad La Joie wrote:

 Okay, I certainly have a number of SAML documents lying around so I'll
 try with those as well. And, of course, I'll report back the results I
 get.

 On 8/10/10 4:46 AM, Raul Benito wrote:

 As the original author of the changes of equals to == in intern
 namespaces, I can tell that original in 1.4 and 1.5 and with my data
 (that was the verification of a SAML/Liberty AuthnReq in a multi thread
 tests, and the old Juice JCE provider). The change was 10% to 20% faster.
 The SAML is one of the real example of signing and has some url with
 common prefixes and same length url.
 The Juice provider also helps to get rid of the signing/digest cost (a
 verification is two c14n one of the signing part and c14n of the
 signature), but i think just a c14n is a good way of measure it.
 Also take into account that the == vs equals debate is more a memory
 workload cache problem, if we have to iterate over and over every char
 just to see if it is not equals, we trash the cache (That's why i used
 the multi thread to simulate a server decoding requests with more or
 less the same code, but in different times and different workload)
 Nevertheless if you have test with a more modern jre and the code
 .equals is behaving better, just go ahead and kiss goodbye to the ==.

 Clive, using the .hashCode for strings in this case is not a big
 speed-up as it is going to go through all the chars of the string,
 trashing cache again, and multiplying and adding the result to an
 integer, instead of a fail in the first different char or just summarize
 to a boolean.\

 Regards,


 On Tue, Aug 10, 2010 at 2:37 AM, Clive Brettingham-Moore
 xml...@brettingham-moore.net mailto:xml...@brettingham-moore.net
 wrote:

 Have to agree .equals is the way to go, since correctness of == is too
 reliant on what must be considered implementation optimisations in the
 parser.

 Benchmarking in JVM is notoriously difficult, but it does look like
 there is no gross difference, which should kill any objections to doing
 it correctly.

 Since I recently spend far to long researching this for an unrelated
 problem I'll add my 10c to the detail discussion.

 On 10/08/10 01:23, Chad La Joie wrote:

  Not necessarily, there are a number of not equal checks in there that
  should, in theory, perform better if you only use == only. In such a
  case, the use of != will just be a single check while !equals() will
  result in a char-by-char comparison.

 Actually, the next thing String.equals tests is length equality - so
 character comparison will only be reached if the strings are the same
 length.

 Since the char by char comparison returns on the first mismatch, then
 only same length strings with shared prefixes will show the expected
 slowness. (namespace URIs are likely to share prefixes, but I think are
 not particularly likely to be the same length, unless actually equal)...
 thus String.equals is only likely to be slow where comparing long
 distinct but equal strings (so intern or alternative string pooling
 techniques needed for == benefit .equals without all the nasty
 loopholes: even if .equals is occasionally slow, at least it is always
 right).

 In circumstances where doing repeated tests with many length and prefix
 matches, adding a hash code inequality test ((s1.hashCode()==
 s2.hashCode())s1.equals(s2)) could prevent practically all
 char-by-char checks for !equal cases (but if the same strings are never
 repeatedly used, the hash code calculation could be an issue; nb intern
 results in hash calculation for all strings anyway)... pooling is still
 needed to speed up matches for equality though.

 Re VM options I would feel -server is definitely the right test bed,
 both because of the more aggressive JIT, and also because the code is
 likely to see heaviest real world cases in -server VMs.




 --
 Chad La Joie
 http://itumi.biz
 trusted identities, delivered



Re: Status of == vs equals() RESULTS

2010-08-24 Thread Chad La Joie

Okay, I'll prepare a patch for you by the end of the week.

On 8/24/10 2:23 PM, Colm O hEigeartaigh wrote:

Sounds fine to me.

Colm.

On Mon, Aug 23, 2010 at 8:55 PM, Chad La Joielaj...@itumi.biz  wrote:

Okay, getting back to this.

I tried my tests again this time with:
  - a 7.5MB SAML metadata document (so lots of comparisons)
  - 100 warm up runs then 100 timed runs
  - an explicit GC between each run to keep it from happening during the runs
since the DOMs were so large

No real difference in results. equals() was faster.

So, at this point, I can't see any reason to do anything other than
equals().  It's the actual correct way of doing the comparison in that it
will always return the proper result and the JVM definitely seems to be
optimizing its use.

On 8/10/10 7:53 AM, Chad La Joie wrote:


Okay, I certainly have a number of SAML documents lying around so I'll
try with those as well. And, of course, I'll report back the results I
get.

On 8/10/10 4:46 AM, Raul Benito wrote:


As the original author of the changes of equals to == in intern
namespaces, I can tell that original in 1.4 and 1.5 and with my data
(that was the verification of a SAML/Liberty AuthnReq in a multi thread
tests, and the old Juice JCE provider). The change was 10% to 20% faster.
The SAML is one of the real example of signing and has some url with
common prefixes and same length url.
The Juice provider also helps to get rid of the signing/digest cost (a
verification is two c14n one of the signing part and c14n of the
signature), but i think just a c14n is a good way of measure it.
Also take into account that the == vs equals debate is more a memory
workload cache problem, if we have to iterate over and over every char
just to see if it is not equals, we trash the cache (That's why i used
the multi thread to simulate a server decoding requests with more or
less the same code, but in different times and different workload)
Nevertheless if you have test with a more modern jre and the code
.equals is behaving better, just go ahead and kiss goodbye to the ==.

Clive, using the .hashCode for strings in this case is not a big
speed-up as it is going to go through all the chars of the string,
trashing cache again, and multiplying and adding the result to an
integer, instead of a fail in the first different char or just summarize
to a boolean.\

Regards,


On Tue, Aug 10, 2010 at 2:37 AM, Clive Brettingham-Moore
xml...@brettingham-moore.netmailto:xml...@brettingham-moore.net
wrote:

Have to agree .equals is the way to go, since correctness of == is too
reliant on what must be considered implementation optimisations in the
parser.

Benchmarking in JVM is notoriously difficult, but it does look like
there is no gross difference, which should kill any objections to doing
it correctly.

Since I recently spend far to long researching this for an unrelated
problem I'll add my 10c to the detail discussion.

On 10/08/10 01:23, Chad La Joie wrote:


Not necessarily, there are a number of not equal checks in there that
should, in theory, perform better if you only use == only. In such a
case, the use of != will just be a single check while !equals() will
result in a char-by-char comparison.


Actually, the next thing String.equals tests is length equality - so
character comparison will only be reached if the strings are the same
length.

Since the char by char comparison returns on the first mismatch, then
only same length strings with shared prefixes will show the expected
slowness. (namespace URIs are likely to share prefixes, but I think are
not particularly likely to be the same length, unless actually equal)...
thus String.equals is only likely to be slow where comparing long
distinct but equal strings (so intern or alternative string pooling
techniques needed for == benefit .equals without all the nasty
loopholes: even if .equals is occasionally slow, at least it is always
right).

In circumstances where doing repeated tests with many length and prefix
matches, adding a hash code inequality test ((s1.hashCode()==
s2.hashCode())s1.equals(s2)) could prevent practically all
char-by-char checks for !equal cases (but if the same strings are never
repeatedly used, the hash code calculation could be an issue; nb intern
results in hash calculation for all strings anyway)... pooling is still
needed to speed up matches for equality though.

Re VM options I would feel -server is definitely the right test bed,
both because of the more aggressive JIT, and also because the code is
likely to see heaviest real world cases in -server VMs.






--
Chad La Joie
http://itumi.biz
trusted identities, delivered





--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-23 Thread Chad La Joie

Okay, getting back to this.

I tried my tests again this time with:
  - a 7.5MB SAML metadata document (so lots of comparisons)
  - 100 warm up runs then 100 timed runs
  - an explicit GC between each run to keep it from happening during 
the runs since the DOMs were so large


No real difference in results. equals() was faster.

So, at this point, I can't see any reason to do anything other than 
equals().  It's the actual correct way of doing the comparison in that 
it will always return the proper result and the JVM definitely seems to 
be optimizing its use.


On 8/10/10 7:53 AM, Chad La Joie wrote:

Okay, I certainly have a number of SAML documents lying around so I'll
try with those as well. And, of course, I'll report back the results I get.

On 8/10/10 4:46 AM, Raul Benito wrote:

As the original author of the changes of equals to == in intern
namespaces, I can tell that original in 1.4 and 1.5 and with my data
(that was the verification of a SAML/Liberty AuthnReq in a multi thread
tests, and the old Juice JCE provider). The change was 10% to 20% faster.
The SAML is one of the real example of signing and has some url with
common prefixes and same length url.
The Juice provider also helps to get rid of the signing/digest cost (a
verification is two c14n one of the signing part and c14n of the
signature), but i think just a c14n is a good way of measure it.
Also take into account that the == vs equals debate is more a memory
workload cache problem, if we have to iterate over and over every char
just to see if it is not equals, we trash the cache (That's why i used
the multi thread to simulate a server decoding requests with more or
less the same code, but in different times and different workload)
Nevertheless if you have test with a more modern jre and the code
.equals is behaving better, just go ahead and kiss goodbye to the ==.

Clive, using the .hashCode for strings in this case is not a big
speed-up as it is going to go through all the chars of the string,
trashing cache again, and multiplying and adding the result to an
integer, instead of a fail in the first different char or just summarize
to a boolean.\

Regards,


On Tue, Aug 10, 2010 at 2:37 AM, Clive Brettingham-Moore
xml...@brettingham-moore.net mailto:xml...@brettingham-moore.net
wrote:

Have to agree .equals is the way to go, since correctness of == is too
reliant on what must be considered implementation optimisations in the
parser.

Benchmarking in JVM is notoriously difficult, but it does look like
there is no gross difference, which should kill any objections to doing
it correctly.

Since I recently spend far to long researching this for an unrelated
problem I'll add my 10c to the detail discussion.

On 10/08/10 01:23, Chad La Joie wrote:

 Not necessarily, there are a number of not equal checks in there that
 should, in theory, perform better if you only use == only. In such a
 case, the use of != will just be a single check while !equals() will
 result in a char-by-char comparison.

Actually, the next thing String.equals tests is length equality - so
character comparison will only be reached if the strings are the same
length.

Since the char by char comparison returns on the first mismatch, then
only same length strings with shared prefixes will show the expected
slowness. (namespace URIs are likely to share prefixes, but I think are
not particularly likely to be the same length, unless actually equal)...
thus String.equals is only likely to be slow where comparing long
distinct but equal strings (so intern or alternative string pooling
techniques needed for == benefit .equals without all the nasty
loopholes: even if .equals is occasionally slow, at least it is always
right).

In circumstances where doing repeated tests with many length and prefix
matches, adding a hash code inequality test ((s1.hashCode()==
s2.hashCode())s1.equals(s2)) could prevent practically all
char-by-char checks for !equal cases (but if the same strings are never
repeatedly used, the hash code calculation could be an issue; nb intern
results in hash calculation for all strings anyway)... pooling is still
needed to speed up matches for equality though.

Re VM options I would feel -server is definitely the right test bed,
both because of the more aggressive JIT, and also because the code is
likely to see heaviest real world cases in -server VMs.






--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-13 Thread Colm O hEigeartaigh
I would prefer if we stuck to the original plan of making sure ==
comparisons are only done for namespaces in a single piece of
pluggable code. However, I think we should now revert to making the
.equals comparison as the default for the next release, given that
there is no compelling reason to do otherwise. Anyone who wants to
experiment with getting a performance increase, can just plug the
other piece of code in.

Thoughts?

Colm.


On Mon, Aug 9, 2010 at 11:07 PM, Chad La Joie laj...@itumi.biz wrote:
 I guess I didn't explicitly say this, but if, after a few days, people can't
 suggest an issue with this testing methodology or provide testing inputs
 that show different results, I'll rip out the helper class I added and just
 use equals() everywhere.  That'll make the code a lot nicer to read.

 On 8/9/10 10:19 AM, Chad La Joie wrote:

 So, I have some unexpected results from this work.

 I implemented a helper class that checked the equality of element local
 names, attribute local names, namespace URIs, and namespace prefixes
 (i.e. everything that Xerces always interns). Then I made sure to
 replace all == != and equals() that I could find with the appropriate
 call.

 To test, I picked the Canonicalizer20010315ExclusiveTest test case and
 made two alterations to the test22*excl methods:
 - do one c14n operation out the timing loop just to make sure all the
 classes are in memory, constants are loaded, etc.
 - in a 100 iteration loop, create a new canonicalizer, canonicalize a
 DOM tree, and time it using nanosecond time

 I did this for the example2_2_1.xml[1], example2_2_2.xml[2], example
 2_2_3.xml[3] input files (test221excl, test221excl, test223excl
 respectively).

 Here are the results, measured in nanosecond timing. total indicates
 the total time spent in all 100 runs, i.e. the summation of each of the
 100 results.

 test221excl:
 equals() ==
 min 101000 99000
 max 123000 191000
 median 103000 105000
 avg 103760 106540
 total 10376000 10654000

 test222excl:
 equals() ==
 min 99000 101000
 max 192000 128000
 median 10 108000
 avg 102110 108480
 total 10211000 10848000

 test223excl (an XPath nodeset canonicalization)
 equals() ==
 min 254000 248000
 max 29 353000
 median 266000 265000
 avg 266820 265800
 total 26682000 2658

 So, what these numbers appear to suggest is that, in fact, equals() is
 more often faster than ==. This seems counter-intuitive unless the JVM
 has specialized optimization for the String.equals() method.

 Can anyone see where my testing is likely to be flawed?

 [1]

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup

 [2]

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup

 [3]

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup


 On 8/2/10 10:11 AM, Chad La Joie wrote:

 So, while I don't have my access yet, Colm asked me if I'd take a look
 at the == vs equals() issue (relevant bugs: 40897[1], 45637[2], 46681[3])

 My executive summary is that clearly, as things stand, the current code
 favors optimization over correctness. Rarely is this a good thing.

 Colm notes[4] that the reliance on intern'ed strings (and thus the
 ability to use ==) occurs sporadically throughout the code and not just
 within the ElementChecker implementations. He specifically mentioned
 that the various C14N implementations, and indeed the == is used about 6
 times there for string comparison.

 My recommendation then is two fold:
 - Ensure that nothing other than namespace bits are compared via ==. I
 don't know that this occurs but the code should definitely be reviewed
 to ensure that.

 - Create a new NamespaceEqualityChecker that provides methods for
 checking the various bits of a namespace (URIs, prefixes) and use it
 anywhere that either == or equals() is used today. Implementations based
 on == and equals() would be provided with the default implementation
 being equals()-based. A configuration option should then be made
 available to control which impl gets used. Additionally, it might even
 be possible to add some smarts that could detect known good parsers
 that use interning and automatically use the == based implementation.

 I do not recommend changing any part of the code without addressing the
 whole codebase (i.e. all the =='s need to be fixed or no change should
 be made) because of the possibility of creating new, unwanted, effects.
 The current functionality is undesirable but better the devil you know.

 I think that this should be addressed in the upcoming 1.4.4 release. If
 quick consensus can be reached I'm willing to do the work with a window
 of time I have available over the next 2-3 weeks.

 [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=40897
 [2] 

Re: Status of == vs equals() RESULTS

2010-08-13 Thread eric fu
I encountered a problem before(version 1.4) caused by apache java code which
uses == for namespace comparison. In my own code, when adding DOM node to a
document, I have to create namespace using string from Apache classes. That
is, I cannot directly use http://www.w3.org/2000/09/xmldsig#; as namespace
String, instead I need to use APACHE.BlashClass.DSIG_URI. The bug is not
only hard to find, but unnecessarily tie unrelated DOM code to XML security.

Eric

On Fri, Aug 13, 2010 at 6:02 AM, Colm O hEigeartaigh cohei...@apache.orgwrote:

 I would prefer if we stuck to the original plan of making sure ==
 comparisons are only done for namespaces in a single piece of
 pluggable code. However, I think we should now revert to making the
 .equals comparison as the default for the next release, given that
 there is no compelling reason to do otherwise. Anyone who wants to
 experiment with getting a performance increase, can just plug the
 other piece of code in.

 Thoughts?

 Colm.


 On Mon, Aug 9, 2010 at 11:07 PM, Chad La Joie laj...@itumi.biz wrote:
  I guess I didn't explicitly say this, but if, after a few days, people
 can't
  suggest an issue with this testing methodology or provide testing inputs
  that show different results, I'll rip out the helper class I added and
 just
  use equals() everywhere.  That'll make the code a lot nicer to read.
 
  On 8/9/10 10:19 AM, Chad La Joie wrote:
 
  So, I have some unexpected results from this work.
 
  I implemented a helper class that checked the equality of element local
  names, attribute local names, namespace URIs, and namespace prefixes
  (i.e. everything that Xerces always interns). Then I made sure to
  replace all == != and equals() that I could find with the appropriate
  call.
 
  To test, I picked the Canonicalizer20010315ExclusiveTest test case and
  made two alterations to the test22*excl methods:
  - do one c14n operation out the timing loop just to make sure all the
  classes are in memory, constants are loaded, etc.
  - in a 100 iteration loop, create a new canonicalizer, canonicalize a
  DOM tree, and time it using nanosecond time
 
  I did this for the example2_2_1.xml[1], example2_2_2.xml[2], example
  2_2_3.xml[3] input files (test221excl, test221excl, test223excl
  respectively).
 
  Here are the results, measured in nanosecond timing. total indicates
  the total time spent in all 100 runs, i.e. the summation of each of the
  100 results.
 
  test221excl:
  equals() ==
  min 101000 99000
  max 123000 191000
  median 103000 105000
  avg 103760 106540
  total 10376000 10654000
 
  test222excl:
  equals() ==
  min 99000 101000
  max 192000 128000
  median 10 108000
  avg 102110 108480
  total 10211000 10848000
 
  test223excl (an XPath nodeset canonicalization)
  equals() ==
  min 254000 248000
  max 29 353000
  median 266000 265000
  avg 266820 265800
  total 26682000 2658
 
  So, what these numbers appear to suggest is that, in fact, equals() is
  more often faster than ==. This seems counter-intuitive unless the JVM
  has specialized optimization for the String.equals() method.
 
  Can anyone see where my testing is likely to be flawed?
 
  [1]
 
 
 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup
 
  [2]
 
 
 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup
 
  [3]
 
 
 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup
 
 
  On 8/2/10 10:11 AM, Chad La Joie wrote:
 
  So, while I don't have my access yet, Colm asked me if I'd take a look
  at the == vs equals() issue (relevant bugs: 40897[1], 45637[2],
 46681[3])
 
  My executive summary is that clearly, as things stand, the current code
  favors optimization over correctness. Rarely is this a good thing.
 
  Colm notes[4] that the reliance on intern'ed strings (and thus the
  ability to use ==) occurs sporadically throughout the code and not just
  within the ElementChecker implementations. He specifically mentioned
  that the various C14N implementations, and indeed the == is used about
 6
  times there for string comparison.
 
  My recommendation then is two fold:
  - Ensure that nothing other than namespace bits are compared via ==. I
  don't know that this occurs but the code should definitely be reviewed
  to ensure that.
 
  - Create a new NamespaceEqualityChecker that provides methods for
  checking the various bits of a namespace (URIs, prefixes) and use it
  anywhere that either == or equals() is used today. Implementations
 based
  on == and equals() would be provided with the default implementation
  being equals()-based. A configuration option should then be made
  available to control which impl gets used. Additionally, it might even
  be possible to add some smarts that could detect known good parsers
  that use 

Re: Status of == vs equals() RESULTS

2010-08-13 Thread eric fu
Xerces C DOM parser wrapped as Java DOM. What I mean conventional equals()
should be preferred though == might have small performance gain.

Eric

On Fri, Aug 13, 2010 at 10:50 AM, Chad La Joie laj...@itumi.biz wrote:

 Which parser/DOM impl were you using?


 On 8/13/10 1:33 PM, eric fu wrote:

 I encountered a problem before(version 1.4) caused by apache java code
 which uses == for namespace comparison. In my own code, when adding DOM
 node to a document, I have to create namespace using string from Apache
 classes. That is, I cannot directly use
 http://www.w3.org/2000/09/xmldsig#; as namespace String, instead I need
 to use APACHE.BlashClass.DSIG_URI. The bug is not only hard to find, but
 unnecessarily tie unrelated DOM code to XML security.

 Eric

 On Fri, Aug 13, 2010 at 6:02 AM, Colm O hEigeartaigh
 cohei...@apache.org mailto:cohei...@apache.org wrote:

I would prefer if we stuck to the original plan of making sure ==
comparisons are only done for namespaces in a single piece of
pluggable code. However, I think we should now revert to making the
.equals comparison as the default for the next release, given that
there is no compelling reason to do otherwise. Anyone who wants to
experiment with getting a performance increase, can just plug the
other piece of code in.

Thoughts?

Colm.


On Mon, Aug 9, 2010 at 11:07 PM, Chad La Joie laj...@itumi.biz
mailto:laj...@itumi.biz wrote:
  I guess I didn't explicitly say this, but if, after a few days,
people can't
  suggest an issue with this testing methodology or provide testing
inputs
  that show different results, I'll rip out the helper class I
added and just
  use equals() everywhere.  That'll make the code a lot nicer to read.
 
  On 8/9/10 10:19 AM, Chad La Joie wrote:
 
  So, I have some unexpected results from this work.
 
  I implemented a helper class that checked the equality of
element local
  names, attribute local names, namespace URIs, and namespace
 prefixes
  (i.e. everything that Xerces always interns). Then I made sure to
  replace all == != and equals() that I could find with the
appropriate
  call.
 
  To test, I picked the Canonicalizer20010315ExclusiveTest test
case and
  made two alterations to the test22*excl methods:
  - do one c14n operation out the timing loop just to make sure
all the
  classes are in memory, constants are loaded, etc.
  - in a 100 iteration loop, create a new canonicalizer,
canonicalize a
  DOM tree, and time it using nanosecond time
 
  I did this for the example2_2_1.xml[1], example2_2_2.xml[2],
 example
  2_2_3.xml[3] input files (test221excl, test221excl, test223excl
  respectively).
 
  Here are the results, measured in nanosecond timing. total
indicates
  the total time spent in all 100 runs, i.e. the summation of each
of the
  100 results.
 
  test221excl:
  equals() ==
  min 101000 99000
  max 123000 191000
  median 103000 105000
  avg 103760 106540
  total 10376000 10654000
 
  test222excl:
  equals() ==
  min 99000 101000
  max 192000 128000
  median 10 108000
  avg 102110 108480
  total 10211000 10848000
 
  test223excl (an XPath nodeset canonicalization)
  equals() ==
  min 254000 248000
  max 29 353000
  median 266000 265000
  avg 266820 265800
  total 26682000 2658
 
  So, what these numbers appear to suggest is that, in fact,
equals() is
  more often faster than ==. This seems counter-intuitive unless
the JVM
  has specialized optimization for the String.equals() method.
 
  Can anyone see where my testing is likely to be flawed?
 
  [1]
 
 

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup
 
 
  [2]
 
 

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup
 
 
  [3]
 
 

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup

 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup
 
 
 
  On 8/2/10 10:11 AM, Chad La Joie wrote:
 
  So, while I don't have my access yet, Colm asked me if I'd take
a look
  at the == vs equals() issue (relevant bugs: 40897[1], 45637[2],
46681[3])
 
 

Re: Status of == vs equals() RESULTS

2010-08-10 Thread Clive Brettingham-Moore
Not to dispute your point but more to clarify mine. Mostly I wanted make 
the minor note about the length test preventing most char-by-char 
comparison (assuming intern or other canonicalization taking care of 
equality, as in the rest of the discussion).


Hash code was an afterthought, which  came to mind since I had recently 
been researching string canonicalization alternatives to intern (eg via 
a HashSet). I was only suggesting hashCode if *repeated* char-by-char 
comparison in unequal strings is causing performance problems (the case 
of same length strings with shared prefix was the most obvious; by the 
sound of it SAML may actually make this relevant in this case). The part 
I apparently didn't emphasize enough is that yes, it only offers 
advantage if strings are used repeatedly in problem comparisons (or 
hashCode has already been used): .hashCode is calculated lazily so will 
only be calculated once per string,  (except for the unlikely case where 
hash code matches the sentinel value 0) - so for repeated use over a 
restricted set of strings the overhead can be amortized (intern 
internally calculates hash code for strings but AFAIK this is not 
currently used to preset this cached code, so there is will be hash 
calculation and associated cache churn one-off; non intern 
canonicalization using a hash table it will have cached the result so 
get it for free).



Raul Benito wrote:

As the original author of the changes of equals to == in intern namespaces,
I can tell that original in 1.4 and 1.5 and with my data (that was the
verification of a SAML/Liberty AuthnReq in a multi thread tests, and the old
Juice JCE provider). The change was 10% to 20% faster.
The SAML is one of the real example of signing and has some url with common
prefixes and same length url.
The Juice provider also helps to get rid of the signing/digest cost (a
verification is two c14n one of the signing part and c14n of the signature),
but i think just a c14n is a good way of measure it.
Also take into account that the == vs equals debate is more a memory
workload cache problem, if we have to iterate over and over every char just
to see if it is not equals, we trash the cache (That's why i used the multi
thread to simulate a server decoding requests with more or less the same
code, but in different times and different workload)
Nevertheless  if you have test with a more modern jre and the code .equals
is behaving better, just go ahead and kiss goodbye to  the ==.

Clive, using the .hashCode for strings in this case is not a big speed-up as
it is going to go through all the chars of the string, trashing cache again,
and multiplying and adding the result to an integer, instead of a fail in
the first different char or just summarize to a boolean.\

Regards,


On Tue, Aug 10, 2010 at 2:37 AM, Clive Brettingham-Moore 
xml...@brettingham-moore.net wrote:

  

Have to agree .equals is the way to go, since correctness of == is too
reliant on what must be considered implementation optimisations in the
parser.

Benchmarking in JVM is notoriously difficult, but it does look like
there is no gross difference, which should kill any objections to doing
it correctly.

Since I recently spend far to long researching this for an unrelated
problem I'll add my 10c to the detail discussion.

On 10/08/10 01:23, Chad La Joie wrote:



Not necessarily, there are a number of not equal checks in there that
should, in theory, perform better if you only use == only.  In such a
case, the use of != will just be a single check while !equals() will
result in a char-by-char comparison.
  

Actually, the next thing String.equals tests is length equality - so
character comparison will only be reached if the strings are the same
length.

Since the char by char comparison returns on the first mismatch, then
only same length strings with shared prefixes will show the expected
slowness. (namespace URIs are likely to share prefixes, but I think are
not particularly likely to be the same length, unless actually equal)...
thus String.equals is only likely to be slow where comparing long
distinct but equal strings (so intern or alternative string pooling
techniques needed for == benefit .equals without all the nasty
loopholes: even if .equals is occasionally slow, at least it is always
right).

In circumstances where doing repeated tests with many length and prefix
matches, adding a hash code inequality test ((s1.hashCode()==
s2.hashCode())s1.equals(s2)) could prevent practically all
char-by-char checks for !equal cases (but if the same strings are never
repeatedly used, the hash code calculation could be an issue; nb intern
results in hash calculation for all strings anyway)... pooling is still
needed to speed up matches for equality though.

Re VM options I would feel -server is definitely the right test bed,
both because of the more aggressive JIT, and also because the code is
likely to see heaviest real world cases in -server VMs.





  




Re: Status of == vs equals() RESULTS

2010-08-09 Thread Chad La Joie

So, I have some unexpected results from this work.

I implemented a helper class that checked the equality of element local 
names, attribute local names, namespace URIs, and namespace prefixes 
(i.e. everything that Xerces always interns).  Then I made sure to 
replace all == != and equals() that I could find with the appropriate call.


To test, I picked the Canonicalizer20010315ExclusiveTest test case and 
made two alterations to the test22*excl methods:
  - do one c14n operation out the timing loop just to make sure all the 
classes are in memory, constants are loaded, etc.
  - in a 100 iteration loop, create a new canonicalizer, canonicalize a 
DOM tree, and time it using nanosecond time


I did this for the example2_2_1.xml[1], example2_2_2.xml[2], example 
2_2_3.xml[3] input files (test221excl, test221excl, test223excl 
respectively).


Here are the results, measured in nanosecond timing.  total indicates 
the total time spent in all 100 runs, i.e. the summation of each of the 
100 results.


test221excl:
equals()==
min 101000 99000
max 123000 191000
median  103000 105000
avg 103760 106540
total   10376000   10654000

test222excl:
equals()==
min 99000  101000
max 192000 128000
median  10 108000
avg 102110 108480
total   10211000   10848000

test223excl (an XPath nodeset canonicalization)
equals()==
min 254000 248000
max 29 353000
median  266000 265000
avg 266820 265800
total   26682000   2658

So, what these numbers appear to suggest is that, in fact, equals() is 
more often faster than ==.  This seems counter-intuitive unless the JVM 
has specialized optimization for the String.equals() method.


Can anyone see where my testing is likely to be flawed?

[1] 
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup
[2] 
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup
[3] 
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup


On 8/2/10 10:11 AM, Chad La Joie wrote:

So, while I don't have my access yet, Colm asked me if I'd take a look
at the == vs equals() issue (relevant bugs: 40897[1], 45637[2], 46681[3])

My executive summary is that clearly, as things stand, the current code
favors optimization over correctness. Rarely is this a good thing.

Colm notes[4] that the reliance on intern'ed strings (and thus the
ability to use ==) occurs sporadically throughout the code and not just
within the ElementChecker implementations. He specifically mentioned
that the various C14N implementations, and indeed the == is used about 6
times there for string comparison.

My recommendation then is two fold:
- Ensure that nothing other than namespace bits are compared via ==. I
don't know that this occurs but the code should definitely be reviewed
to ensure that.

- Create a new NamespaceEqualityChecker that provides methods for
checking the various bits of a namespace (URIs, prefixes) and use it
anywhere that either == or equals() is used today. Implementations based
on == and equals() would be provided with the default implementation
being equals()-based. A configuration option should then be made
available to control which impl gets used. Additionally, it might even
be possible to add some smarts that could detect known good parsers
that use interning and automatically use the == based implementation.

I do not recommend changing any part of the code without addressing the
whole codebase (i.e. all the =='s need to be fixed or no change should
be made) because of the possibility of creating new, unwanted, effects.
The current functionality is undesirable but better the devil you know.

I think that this should be addressed in the upcoming 1.4.4 release. If
quick consensus can be reached I'm willing to do the work with a window
of time I have available over the next 2-3 weeks.

[1] https://issues.apache.org/bugzilla/show_bug.cgi?id=40897
[2] https://issues.apache.org/bugzilla/show_bug.cgi?id=45637
[3] https://issues.apache.org/bugzilla/show_bug.cgi?id=46681
[4] https://issues.apache.org/bugzilla/show_bug.cgi?id=45637#c1


--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-09 Thread Raul Benito
Hello Chad,
What command line options did you use?
My testings were more reliable if use 100 warms-up let the jit run its
magic, and then go for the timed test.
Also are you running both tests in the same invocation if you do, the second
will be handicap, as the first one will be just inline the second will have
a switch to see if it is one interface or the other.

Regards,

Raul


On Mon, Aug 9, 2010 at 4:19 PM, Chad La Joie laj...@itumi.biz wrote:

 So, I have some unexpected results from this work.

 I implemented a helper class that checked the equality of element local
 names, attribute local names, namespace URIs, and namespace prefixes (i.e.
 everything that Xerces always interns).  Then I made sure to replace all ==
 != and equals() that I could find with the appropriate call.

 To test, I picked the Canonicalizer20010315ExclusiveTest test case and made
 two alterations to the test22*excl methods:
  - do one c14n operation out the timing loop just to make sure all the
 classes are in memory, constants are loaded, etc.
  - in a 100 iteration loop, create a new canonicalizer, canonicalize a DOM
 tree, and time it using nanosecond time

 I did this for the example2_2_1.xml[1], example2_2_2.xml[2], example
 2_2_3.xml[3] input files (test221excl, test221excl, test223excl
 respectively).

 Here are the results, measured in nanosecond timing.  total indicates the
 total time spent in all 100 runs, i.e. the summation of each of the 100
 results.

 test221excl:
equals()==
 min 101000 99000
 max 123000 191000
 median  103000 105000
 avg 103760 106540
 total   10376000   10654000

 test222excl:
equals()==
 min 99000  101000
 max 192000 128000
 median  10 108000
 avg 102110 108480
 total   10211000   10848000

 test223excl (an XPath nodeset canonicalization)
equals()==
 min 254000 248000
 max 29 353000
 median  266000 265000
 avg 266820 265800
 total   26682000   2658

 So, what these numbers appear to suggest is that, in fact, equals() is more
 often faster than ==.  This seems counter-intuitive unless the JVM has
 specialized optimization for the String.equals() method.

 Can anyone see where my testing is likely to be flawed?

 [1]
 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup
 [2]
 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup
 [3]
 http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup

 On 8/2/10 10:11 AM, Chad La Joie wrote:

 So, while I don't have my access yet, Colm asked me if I'd take a look
 at the == vs equals() issue (relevant bugs: 40897[1], 45637[2], 46681[3])

 My executive summary is that clearly, as things stand, the current code
 favors optimization over correctness. Rarely is this a good thing.

 Colm notes[4] that the reliance on intern'ed strings (and thus the
 ability to use ==) occurs sporadically throughout the code and not just
 within the ElementChecker implementations. He specifically mentioned
 that the various C14N implementations, and indeed the == is used about 6
 times there for string comparison.

 My recommendation then is two fold:
 - Ensure that nothing other than namespace bits are compared via ==. I
 don't know that this occurs but the code should definitely be reviewed
 to ensure that.

 - Create a new NamespaceEqualityChecker that provides methods for
 checking the various bits of a namespace (URIs, prefixes) and use it
 anywhere that either == or equals() is used today. Implementations based
 on == and equals() would be provided with the default implementation
 being equals()-based. A configuration option should then be made
 available to control which impl gets used. Additionally, it might even
 be possible to add some smarts that could detect known good parsers
 that use interning and automatically use the == based implementation.

 I do not recommend changing any part of the code without addressing the
 whole codebase (i.e. all the =='s need to be fixed or no change should
 be made) because of the possibility of creating new, unwanted, effects.
 The current functionality is undesirable but better the devil you know.

 I think that this should be addressed in the upcoming 1.4.4 release. If
 quick consensus can be reached I'm willing to do the work with a window
 of time I have available over the next 2-3 weeks.

 [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=40897
 [2] https://issues.apache.org/bugzilla/show_bug.cgi?id=45637
 [3] https://issues.apache.org/bugzilla/show_bug.cgi?id=46681
 [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=45637#c1


 --
 Chad La Joie
 http://itumi.biz
 trusted identities, delivered



Re: Status of == vs equals() RESULTS

2010-08-09 Thread Chad La Joie



On 8/9/10 10:40 AM, Raul Benito wrote:

What command line options did you use?


No options.


My testings were more reliable if use 100 warms-up let the jit run its
magic, and then go for the timed test.


Okay, I try that.


Also are you running both tests in the same invocation if you do, the
second will be handicap, as the first one will be just inline the second
will have a switch to see if it is one interface or the other.


No, each run was in a clean JVM.

--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-09 Thread Chad La Joie



On 8/9/10 10:45 AM, Chad La Joie wrote:

My testings were more reliable if use 100 warms-up let the jit run its
magic, and then go for the timed test.


Okay, I try that.


It made no difference.

--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-09 Thread Raul Benito
On Mon, Aug 9, 2010 at 4:45 PM, Chad La Joie laj...@itumi.biz wrote:



 On 8/9/10 10:40 AM, Raul Benito wrote:

 What command line options did you use?


 No options.

 I did mine with --server and sometimes with more memory but it is really
strange, what version of the JRE are you using?

Regards,



  My testings were more reliable if use 100 warms-up let the jit run its
 magic, and then go for the timed test.


 Okay, I try that.


  Also are you running both tests in the same invocation if you do, the
 second will be handicap, as the first one will be just inline the second
 will have a switch to see if it is one interface or the other.


 No, each run was in a clean JVM.


 --
 Chad La Joie
 http://itumi.biz
 trusted identities, delivered



RE: Status of == vs equals() RESULTS

2010-08-09 Thread Pellerin, Clement
In JDK 1.5, String.equals() begins with:

public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
...

Since String is a final class, the JIT compiler is free to in-line 
String.equals()
This is such a common case, I bet the JIT compiler team made it a special case 
to in-line at least the beginning of String.equals() at every invocation site.

If your test bed only uses intern Strings this will return early with the same 
behavior as == for equal strings.
Is it possible your test bed calls String.equals() with an overwhelming 
percentage of equal strings?

-Original Message-
From: Chad La Joie [mailto:laj...@itumi.biz] 
Sent: Monday, August 09, 2010 10:20 AM
To: security-dev@xml.apache.org
Subject: Re: Status of == vs equals() RESULTS

So, I have some unexpected results from this work.

I implemented a helper class that checked the equality of element local 
names, attribute local names, namespace URIs, and namespace prefixes 
(i.e. everything that Xerces always interns).  Then I made sure to 
replace all == != and equals() that I could find with the appropriate call.

To test, I picked the Canonicalizer20010315ExclusiveTest test case and 
made two alterations to the test22*excl methods:
   - do one c14n operation out the timing loop just to make sure all the 
classes are in memory, constants are loaded, etc.
   - in a 100 iteration loop, create a new canonicalizer, canonicalize a 
DOM tree, and time it using nanosecond time

I did this for the example2_2_1.xml[1], example2_2_2.xml[2], example 
2_2_3.xml[3] input files (test221excl, test221excl, test223excl 
respectively).

Here are the results, measured in nanosecond timing.  total indicates 
the total time spent in all 100 runs, i.e. the summation of each of the 
100 results.

test221excl:
 equals()==
min 101000 99000
max 123000 191000
median  103000 105000
avg 103760 106540
total   10376000   10654000

test222excl:
 equals()==
min 99000  101000
max 192000 128000
median  10 108000
avg 102110 108480
total   10211000   10848000

test223excl (an XPath nodeset canonicalization)
 equals()==
min 254000 248000
max 29 353000
median  266000 265000
avg 266820 265800
total   26682000   2658

So, what these numbers appear to suggest is that, in fact, equals() is 
more often faster than ==.  This seems counter-intuitive unless the JVM 
has specialized optimization for the String.equals() method.

Can anyone see where my testing is likely to be flawed?

[1] 
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup
[2] 
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup
[3] 
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup

On 8/2/10 10:11 AM, Chad La Joie wrote:
 So, while I don't have my access yet, Colm asked me if I'd take a look
 at the == vs equals() issue (relevant bugs: 40897[1], 45637[2], 46681[3])

 My executive summary is that clearly, as things stand, the current code
 favors optimization over correctness. Rarely is this a good thing.

 Colm notes[4] that the reliance on intern'ed strings (and thus the
 ability to use ==) occurs sporadically throughout the code and not just
 within the ElementChecker implementations. He specifically mentioned
 that the various C14N implementations, and indeed the == is used about 6
 times there for string comparison.

 My recommendation then is two fold:
 - Ensure that nothing other than namespace bits are compared via ==. I
 don't know that this occurs but the code should definitely be reviewed
 to ensure that.

 - Create a new NamespaceEqualityChecker that provides methods for
 checking the various bits of a namespace (URIs, prefixes) and use it
 anywhere that either == or equals() is used today. Implementations based
 on == and equals() would be provided with the default implementation
 being equals()-based. A configuration option should then be made
 available to control which impl gets used. Additionally, it might even
 be possible to add some smarts that could detect known good parsers
 that use interning and automatically use the == based implementation.

 I do not recommend changing any part of the code without addressing the
 whole codebase (i.e. all the =='s need to be fixed or no change should
 be made) because of the possibility of creating new, unwanted, effects.
 The current functionality is undesirable but better the devil you know.

 I think that this should be addressed in the upcoming 1.4.4 release. If
 quick consensus can be reached I'm willing to do the work with a window
 of time I have available over the next 2-3 weeks.

 [1] https://issues.apache.org/bugzilla

Re: Status of == vs equals() RESULTS

2010-08-09 Thread Chad La Joie



On 8/9/10 11:10 AM, Raul Benito wrote:

I did mine with --server and sometimes with more memory but it is really
strange, what version of the JRE are you using?


What optimizations in particular did you want to take advantage of using 
--server?


Did you see anything to suggest that it was running out of memory? 
Those test files should produce anything that would use up the default 
amount of memory.


I'm using Apple's repackage of Sun JDK 1.6.0_20, 64 bit

--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-09 Thread Chad La Joie
I guess I didn't explicitly say this, but if, after a few days, people 
can't suggest an issue with this testing methodology or provide testing 
inputs that show different results, I'll rip out the helper class I 
added and just use equals() everywhere.  That'll make the code a lot 
nicer to read.


On 8/9/10 10:19 AM, Chad La Joie wrote:

So, I have some unexpected results from this work.

I implemented a helper class that checked the equality of element local
names, attribute local names, namespace URIs, and namespace prefixes
(i.e. everything that Xerces always interns). Then I made sure to
replace all == != and equals() that I could find with the appropriate call.

To test, I picked the Canonicalizer20010315ExclusiveTest test case and
made two alterations to the test22*excl methods:
- do one c14n operation out the timing loop just to make sure all the
classes are in memory, constants are loaded, etc.
- in a 100 iteration loop, create a new canonicalizer, canonicalize a
DOM tree, and time it using nanosecond time

I did this for the example2_2_1.xml[1], example2_2_2.xml[2], example
2_2_3.xml[3] input files (test221excl, test221excl, test223excl
respectively).

Here are the results, measured in nanosecond timing. total indicates
the total time spent in all 100 runs, i.e. the summation of each of the
100 results.

test221excl:
equals() ==
min 101000 99000
max 123000 191000
median 103000 105000
avg 103760 106540
total 10376000 10654000

test222excl:
equals() ==
min 99000 101000
max 192000 128000
median 10 108000
avg 102110 108480
total 10211000 10848000

test223excl (an XPath nodeset canonicalization)
equals() ==
min 254000 248000
max 29 353000
median 266000 265000
avg 266820 265800
total 26682000 2658

So, what these numbers appear to suggest is that, in fact, equals() is
more often faster than ==. This seems counter-intuitive unless the JVM
has specialized optimization for the String.equals() method.

Can anyone see where my testing is likely to be flawed?

[1]
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_1.xml?revision=350494view=markup

[2]
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_2.xml?revision=350494view=markup

[3]
http://svn.apache.org/viewvc/xml/security/trunk/data/org/apache/xml/security/c14n/inExcl/example2_2_3.xml?revision=350915view=markup


On 8/2/10 10:11 AM, Chad La Joie wrote:

So, while I don't have my access yet, Colm asked me if I'd take a look
at the == vs equals() issue (relevant bugs: 40897[1], 45637[2], 46681[3])

My executive summary is that clearly, as things stand, the current code
favors optimization over correctness. Rarely is this a good thing.

Colm notes[4] that the reliance on intern'ed strings (and thus the
ability to use ==) occurs sporadically throughout the code and not just
within the ElementChecker implementations. He specifically mentioned
that the various C14N implementations, and indeed the == is used about 6
times there for string comparison.

My recommendation then is two fold:
- Ensure that nothing other than namespace bits are compared via ==. I
don't know that this occurs but the code should definitely be reviewed
to ensure that.

- Create a new NamespaceEqualityChecker that provides methods for
checking the various bits of a namespace (URIs, prefixes) and use it
anywhere that either == or equals() is used today. Implementations based
on == and equals() would be provided with the default implementation
being equals()-based. A configuration option should then be made
available to control which impl gets used. Additionally, it might even
be possible to add some smarts that could detect known good parsers
that use interning and automatically use the == based implementation.

I do not recommend changing any part of the code without addressing the
whole codebase (i.e. all the =='s need to be fixed or no change should
be made) because of the possibility of creating new, unwanted, effects.
The current functionality is undesirable but better the devil you know.

I think that this should be addressed in the upcoming 1.4.4 release. If
quick consensus can be reached I'm willing to do the work with a window
of time I have available over the next 2-3 weeks.

[1] https://issues.apache.org/bugzilla/show_bug.cgi?id=40897
[2] https://issues.apache.org/bugzilla/show_bug.cgi?id=45637
[3] https://issues.apache.org/bugzilla/show_bug.cgi?id=46681
[4] https://issues.apache.org/bugzilla/show_bug.cgi?id=45637#c1




--
Chad La Joie
http://itumi.biz
trusted identities, delivered


Re: Status of == vs equals() RESULTS

2010-08-09 Thread Clive Brettingham-Moore
Have to agree .equals is the way to go, since correctness of == is too
reliant on what must be considered implementation optimisations in the
parser.

Benchmarking in JVM is notoriously difficult, but it does look like
there is no gross difference, which should kill any objections to doing
it correctly.

Since I recently spend far to long researching this for an unrelated
problem I'll add my 10c to the detail discussion.

On 10/08/10 01:23, Chad La Joie wrote:

 Not necessarily, there are a number of not equal checks in there that
 should, in theory, perform better if you only use == only.  In such a
 case, the use of != will just be a single check while !equals() will
 result in a char-by-char comparison.

Actually, the next thing String.equals tests is length equality - so
character comparison will only be reached if the strings are the same
length.

Since the char by char comparison returns on the first mismatch, then
only same length strings with shared prefixes will show the expected
slowness. (namespace URIs are likely to share prefixes, but I think are
not particularly likely to be the same length, unless actually equal)...
thus String.equals is only likely to be slow where comparing long
distinct but equal strings (so intern or alternative string pooling
techniques needed for == benefit .equals without all the nasty
loopholes: even if .equals is occasionally slow, at least it is always
right).

In circumstances where doing repeated tests with many length and prefix
matches, adding a hash code inequality test ((s1.hashCode()==
s2.hashCode())s1.equals(s2)) could prevent practically all
char-by-char checks for !equal cases (but if the same strings are never
repeatedly used, the hash code calculation could be an issue; nb intern
results in hash calculation for all strings anyway)... pooling is still
needed to speed up matches for equality though.

Re VM options I would feel -server is definitely the right test bed,
both because of the more aggressive JIT, and also because the code is
likely to see heaviest real world cases in -server VMs.