Thanks Tom, I kept on plugging away and came up with I believe a working script. It's going to take some time for me to digest what you've written me. I'll play around with yours tomorrow; I really appreciate your help! I've updated note 1 that you had provided me :-) Here's what I came up with right before you sent your reply.
page: read http://www.rebol.com ; webpage to be parsed title: copy "" description: copy [] keywords: copy [] parse page [ thru <title> copy title to </title>] print title parse page [ thru "<meta name=^"keywords^" content=" copy keywords to ">" ] either not none? (find/last keywords "/") [ keywords: tail keywords keywords-tail: skip keywords -1 if keywords-tail = "/" [keywords: remove keywords-tail] print head keywords ][if/else empty? keywords [print "blank"][print keywords]] parse page [ thru "<meta name=^"description^" content=" copy description to ">" ] either not none? (find/last description "/") [ description: tail description description-tail: skip description -1 if description-tail = "/" [description: remove description-tail] print head description ][if/else empty? description [print "blank"][print description]] =============================================== ----- Original Message ----- From: "Tom" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, September 11, 2008 11:32 PM Subject: [REBOL] Re: How to properly parse HTML and XHTML Meta Tags > Hi Von welcome, > > note 1: when you initialize words with empty strings or blocks > you *do* want to copy the empty string or block. \ > (otherwise they can be the *same* empty block or string) > > title: copy "" > description: copy [] > keywords: copy [] > > > note 2: when using parse for more than simple string splitting get use > to using the /all refinement and handling white space yourself. > > you could define a class of chars that are not "/>" then copy some of > them. downside is you would have to check if a "/" you ran into was > followed by ">" and if not concatenate and continue. > this code untested and un-run > > > tag-end: charset "/>" > content: complement tag-end > ... > parse page [ > ... > thru "<meta name=^"keywords^" content=" > some[ > copy token some content > here: ;;; make a pointer to where parse is > (append keywords token > all[#"/" == first :here > #">" != second :here > append keywords "/" > here: next :here ;;; move parse pointer over "/" > ]) > :here ;;; set where pars will resume > ] > thru ">" > ... > ] > > ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > > you could detect closing angle and see of the proceeding char is a slash > and if so remove it from the copied string. > > note: this is running parse once not multiple times > using braces for string that contain double quotes > and taking the destination for the content copied > from the meta name=<dest> i.e keyword or description block... > > > parse page [ > thru <head> > some[ > thru {<META NAME="} > copy dest to {"} {"} > thru {content=} > copy token to ">" here: thru ">" > (if #"/" = first back :here [trim/with token "/"] > append get to-word dest token > ) > ] > <title> copy title to </title> tag! > ] > print title > print description > print keywords > > ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > > but ultimately I would probably start with > > blk: load/markup <source> > > which would return a block of string! and tag! > > then process the tags; if I used parse I would end with > the rule like > [{<META NAME="} ... ["/>" | ">"]] > > note: this won't work with the > page: read <source> > because there may be a "/>" beyond the first ">" that closes the meta > tag but with load/markup each tag and string element is isolated > > > hope that helps > > > > -- > To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject. > -- To unsubscribe from the list, just send an email to lists at rebol.com with unsubscribe as the subject.
