[jira] Created: (XMLBEANS-295) setLoadStripWhitespace() api errors when trimming white space characters

David RR Webber (JIRA) Thu, 02 Nov 2006 09:11:42 -0800

setLoadStripWhitespace() api errors when trimming white space characters
------------------------------------------------------------------------


                 Key: XMLBEANS-295
                 URL: http://issues.apache.org/jira/browse/XMLBEANS-295
             Project: XMLBeans
          Issue Type: Bug
          Components: Validator
    Affects Versions: Version 2.2.1
         Environment: SunOS 5.9 and Microsoft Windows XP SP2, Java 1.4.2
            Reporter: David RR Webber
             Fix For: TBD


Situation Summary

We implemented to production using the setLoadStripWhitespace() api in 
XMLBeans.  After some days we started getting intermittent failures from 
occasional XML transactions.

After a week of investigation we realized that flushText() method itself was 
the cause - having eliminated all other factors.  Specifically we have 
determined that character strings containing the & character result in spaces 
being stripped immediately after the & - e.g. <company>B & H Photo</company> 
becomes <company>B &H Photo</company>.

We realize that there is a patch available for & processing - and we are 
currently testing that to see if is cures the problem relating to & 
(http://issues.apache.org/jira/browse/XMLBEANS-274 )

However we are also seeing an intermittent problem in our UNIX environment 
associated with colon : (could be other characters as well - we do not have 
definitive list). What we found is intermittent spaces being trimmed in various 
fields that do not contain "&" (the original XMLBEAN-274 bug reported).  This 
one we cannot reproduce in our Windows development systems - but it is 
happening intermittently in SunOS. 

Again space either immediately following the colon or in subsequent string is 
stripped - for tokenized elements - e.g.  <urgent>Yes: Y</urgent>  becomes 
<urgent>Yes:Y</urgent> and then the object returns NULL value because this is 
then not a valid allowed value for the tokenized list. Similarly <location>USA: 
United States</location> became <location>USA: UnitedStates</location>.  We 
suspect that there is a prior character before the colon that might be 
triggering this behaviour but we have not yet determined when or how.  This 
illustrates how complex this issue is in terms of the current XMLBeans 
implementation approach.

Analysis

We have looked at how and where XMLBeans is doing the white space trim during 
the unmarshalling of the XML content.  When it detects a white space - it then 
invokes a stripRight() method loop.  We are not convinced that this is 
architecturally sound at the point it is employed - it is leading to complexity 
and obviously a lot of edge conditions and some combinations of characters that 
are not handled consistently and correctly.

Our preferred approach would be to defer the white space trim until 
post-unmarshalling - so the initial process can treat the XML content "as is" 
between the angle brackets - then once extracted - then apply the trim().  At 
that point a simple java string object trim() can be employed.  This could be 
provided as an alternate method call to the current setLoadStripWhitespace() 
api that would iterate through the entire structure of objects instead of the 
original XML stream.  The only check that would be necessary is if the XML 
markup itself set the xml:space="preserve" attribute option for an element 
object - in which case the trim() would be automatically skipped for that 
content object item.  What is happening right now is that the existing 
flushText() method is mixing up XML markup and the content - instead there 
needs to be a clear separation between the element angle brackets and attribute 
quotes - and the content itself.

Again the caveat maybe here - maybe the current approach is intended to be 
prior to error checking on tokenized lists - to prevent failure there due to 
extra spaces?   However - even so it is not cleanly enough separated - and 
clearly again it would be simpler to use a java string class trim method within 
the tokenized evaluation itself on just the string.

Suggested Solution

Re-factor the current white space setLoadStripWhitespace() api to delay string 
manipulation on content until after unpacking of the content and XML markup - 
instead of prior-to as is currently happening.  This makes for much simpler 
white space trim logic (can simply use the Java string class method) that does 
not need to look for markup artifacts as well.

We are not clear on who owns this particular feature in XMLBeans - whether they 
are currently available to assist on this - but we would be prepared to work 
with the team to develop a better solution here.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (XMLBEANS-295) setLoadStripWhitespace() api errors when trimming white space characters

Reply via email to