[android-developers] Re: SAXParser reports diffeernt qName on SDK 0.9 from SDK 1.0

Chris Cicc Thu, 02 Oct 2008 12:36:21 -0700

Hey Charlie and Brad,
Good news! I now have the ampersand being parsed correctly. However
the change needed wasn't what we expected. Every time I changed the
raw text in the database from the '&' character to the escaped '&amp;'
or '&#038;' it didn't work, it would still break at that first
ampersand.


After doing some more research, I came across this:
http://java.sun.com/j2ee/1.4/docs/tutorial/doc/JAXPSAX3.html

In that document, it says:

"Note: To be strictly accurate, the character handler should scan the
buffer for ampersand characters (&);and left-angle bracket characters
(<) and replace them with the strings &amp; or &lt;, as appropriate.
You'll find out more about that kind of processing when we discuss
entity references in Displaying Special Characters and CDATA. "

I was working on implementing that, when I read further about CDATA
sections. I decided to try this, by implementing the
XmlDocument.CreateCDataSection method instead of the
XmlDocument.CreateTextNode method I'm currently using in my .Net web
service. Without having to modify my SAXParser code at all it worked
with the new CDATA section!

So what did I learn:
1. The SAXParser does indeed break like this by design.
2. Android beta's apparently did not implement a properly spec'ed
SAXParser.
3. The SAXParser may be lightweight, but it comes at the cost of
parsing robustness. For instance, the built in .Net parser does not
have this issue. It simply reads the node, then everything after the
node until it reaches an end node. It's smart enough to detect full
nodes on their own, without simply assuming anything after '<' is a
new node like SAX does.

Going forward, I'm going to keep using CDATA sections, and look to
replace the parser if needed in the future.

As a developer, I'm really disappointed a Google rep didn't chime in
on this conversation. I'm used to having everything posted at
forums.asp.net read by Microsoft devs. But I appreciate all the help
of fellow community members like yourselves!

-chris

On Oct 2, 7:24 am, Charlie Collins <[EMAIL PROTECTED]> wrote:
> It's just &#038;, or &amp;
>
> http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_re...
>
> The & and the ; delimit the entity.
>
> But Chris, your XML in your source example there can't have an
> ampersand there like that.  You need to be using the escape/encoding.
> If everything in the chain supports UTF-8 you can use &amp;, if not,
> use the numerical entity version, &#038;.
>
> Again, this is a different topic than the differences in the parsing,
> but every XML processor I have ever seen will blow up on a non-escaped
> ampersand.
>
> "Characters (or code points in Unicode terminology) outside the simple
> ASCII range 32-127 (&#x20; to &#x7F;) must either be encoded as multi-
> byte UTF-8 sequences or using numerical entities. In environments that
> do not natively support UTF-8 it is often easier to use numerical
> entities"
>
> For example, the XML I am using is coming from Google Base - it is
> UTF-8, but you STILL have to use the encoding to escape the special
> chars:
>
> <?xml version='1.0' encoding='UTF-8'?>
> <feed xmlns='http://www.w3.org/2005/Atom'xmlns:openSearch='http://
> a9.com/-/spec/opensearchrss/1.0/'
>         xmlns:gm='http://base.google.com/ns-metadata/1.0'xmlns:g='http://
> base.google.com/ns/1.0'
>         xmlns:batch='http://schemas.google.com/gdata/batch'>
>         <id>http://www.google.com/base/feeds/snippets
>         </id>
>         <updated>2008-09-29T18:18:13.843Z</updated>
>         <title type='text'>Items matching query: ([review
>                 type:restaurant][location:Atlanta, GA]) [item type == 
> "reviews"]
>         </title>
>         <link rel='alternate' type='text/html' href='http://base.google.com'/
>
>         <link rel='http://schemas.google.com/g/2005#feed'type='application/
> atom+xml'
>                 href='http://www.google.com/base/feeds/snippets'/>
>         <link rel='http://schemas.google.com/g/2005#batch'type='application/
> atom+xml'
>                 href='http://www.google.com/base/feeds/snippets/batch'/>
>         <link rel='self' type='application/atom+xml'
>                 
> href='http://www.google.com/base/feeds/snippets/-/reviews?start-
> index=1&amp;max-results=8&amp;bq=%5Breview+type%3Arestaurant%5D
> %5Blocation%3AAtlanta%2C+GA%5D' />
>         <link rel='next' type='application/atom+xml'
>                 
> href='http://www.google.com/base/feeds/snippets/-/reviews?start-
> index=9&amp;max-results=8&amp;bq=%5Breview+type%3Arestaurant%5D
> %5Blocation%3AAtlanta%2C+GA%5D' />
>         <author>
>                 <name>Google Inc.</name>
>                 <email>[EMAIL PROTECTED]</email>
>         </author>
>         <generator version='1.0' uri='http://base.google.com'>GoogleBase</
> generator>
>         <openSearch:totalResults>199</openSearch:totalResults>
>         <openSearch:startIndex>1</openSearch:startIndex>
>         <openSearch:itemsPerPage>8</openSearch:itemsPerPage>
>         <entry>
> . . . . .
>
> On Oct 1, 7:18 pm, "Brad Gies" <[EMAIL PROTECTED]> wrote:
>
> > Charlie,
>
> > Yes, I think we are saying ALMOST the same thing. But, I don't think &#038;
> > is the Escaped Ampersand. I think it's just the Ampersand, and that's why
> > it's causing the problem.
>
> > As I say, I'm not a Unicode expert, but I think the proper sequence for an
> > escaped ampersand would be : &#038; &#038; I think that's how an escaped
> > ampersand would look in UTF-8. The ampersand escaping the ampersand :). Or,
> > of course the &amp;
>
> > Sorry, I can't try it right now, but I'm interested to know if it works.
> > When I have time, I'll build an app to check it.
>
> > Sincerely,
>
> > Brad Gies
>
> > -----------------------------------------------------------------
> > Brad Gies
> > 27415 Greenfield Rd, # 2,
> > Southfield, MI, USA
> > 48076www.bgies.com www.truckerphone.comwww.EDI-Easy.com www.pricebunny.com
> > -----------------------------------------------------------------
>
> > Moderation in everything, including abstinence
>
> > -----Original Message-----
> > From: android-developers@googlegroups.com
> > [mailto:[EMAIL PROTECTED] On Behalf Of Chris Cicc
> > Sent: Tuesday, September 30, 2008 10:10 AM
> > To: Android Developers
> > Subject: [android-developers] Re:SAXParserreports diffeernt qName on SDK
> > 0.9 from SDK 1.0
>
> > Hey Brad,
> > Just to be sure I tested it out and manually typed in "&amp;" into the
> > source for the web service. I didn't expect this to work, because even
> > manually typing it in still leads to each character being encoded.
>
> > In the quote you provided it says "they MUST be escaped using either
> > numeric character references...". UTF-8 (and all unicode) encoding
> > does just that :) The '&' is number 38.
>
> > On the other hand, I also tested the bracket characters < and >. Both
> > cause the same issue as the & character. Other brackets such as [ and
> > { and ( do not cause issue.
>
> > So clearly this does have something to do with theSAXParserin
> > Android handling the special XML characters. I have never used
> >SAXParseroutside of Android so I cannot say whether or not it is any
> > different. But I can confirm that this did not happen in 0.9 and I am
> > 99% confident it should not be happening at all.
>
> > Thanks,
> > Chris
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Android Developers" group.
To post to this group, send email to android-developers@googlegroups.com
To unsubscribe from this group, send email to
[EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/android-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

[android-developers] Re: SAXParser reports diffeernt qName on SDK 0.9 from SDK 1.0

Reply via email to