On Mon, Jan 17, 2011 at 9:53 AM, porneL <[email protected]> wrote:

> On Mon, 17 Jan 2011 01:45:43 -0000, Miller Medeiros <
> [email protected]> wrote:
>
>  I still believe that this analogy fits well.. XML is stricter than HTML
>> and have simpler rules (all tags open and close on a sane order) and because
>> of that is easier to parse..
>>
>
> A little off-topic: I've been implementing my own HTML and XML parsers, and
> I don't agree that XML is easier to parse.
>
> The seemingly magic rules for optional tags in HTML are actually very
> simple to implement, and you can hardcode them instead of using real DTD.
>
> Handling of empty elements is a matter of looking up tagname in a fixed
> list vs two extra states in an XML parser — it's not very different in
> complexity. Optionally closed tags are piece of cake to implement too
> (basically you implement part of XML error handling, except the line that
> stops the parser).
>
> XML has huge additional complexity. Before you even start, you need to
> write an SGML DTD parser and fetch half dozen files in order to be able to
> parse a typical XHTML file. The syntax is additionally complicated by
> allowing infinitely nested entities containing markup and namespace
> indirection. Even XML's strict error handling is not helpful, because these
> are extra code paths and strict behaviors you have to add to the parser.
>
>
I totally disagree.. I was just talking about returning the content of a
node and it's attributes, nothing about DTD, schema, error handling.. the
serialization process is easier... you can go char by char (or using a
RegExp) matching for opening tags and wait until you find a closing tag..
everything in between is the content of that node, nothing hard coded... on
HTML certain tags will auto-close the parent node so you need to know them
beforehand and hard code these values (into a hash table, array or something
like that)..

//simple example of retrieving node content
var xmlString = '<xml>dolor sit amet <tag>lorem ipsum</tag>
<anotherTag>maecennas</anotherTag></xml>';
function getNodeContent(nodeName, xmlString){
  var regexp = new RegExp('<'+ nodeName +'>(.+)<\\/'+ nodeName +'>');
  return regexp.exec(xmlString)[1];
}
console.log( getNodeContent('tag', xmlString) ); //will output "lorem ipsum"

It was just to explain that stricter rules can reduce complexity in some
cases since you can "ignore" edge-cases. I thought that the XML parsing
being easier than HTML was a common sense... - I'm not going to keep
discussing about XML/HTML complexity on a JS list.

PS: one of the reasons why JSON is so strict is to avoid ambiguity and make
it easier to parse...

cheers.

-- 
To view archived discussions from the original JSMentors Mailman list: 
http://www.mail-archive.com/[email protected]/

To search via a non-Google archive, visit here: 
http://www.mail-archive.com/[email protected]/

To unsubscribe from this group, send email to
[email protected]

Reply via email to