[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

vonja Fri, 12 Sep 2008 00:06:37 -0700

Thanks Tom,
I kept on plugging away and came up with I believe
a working script.  It's going to take some time for
me to digest what you've written me.  I'll play around
with yours tomorrow; I really appreciate your help!
I've updated note 1 that you had provided me :-)
Here's what I came up with right before you sent
your reply.


page: read http://www.rebol.com   ; webpage to be parsed
    title: copy ""   description: copy []  keywords: copy []
    parse page [ thru <title> copy title to </title>]
    print title

    parse page [ thru "<meta name=^"keywords^" content=" copy keywords to 
">" ]
    either not none? (find/last keywords "/") [
    keywords: tail keywords
    keywords-tail: skip keywords -1
    if keywords-tail = "/" [keywords: remove keywords-tail]
    print head keywords
    ][if/else empty? keywords [print "blank"][print keywords]]

      parse page [ thru "<meta name=^"description^" content=" copy 
description to ">" ]
    either not none? (find/last description "/") [
    description: tail description
    description-tail: skip description -1
    if description-tail = "/" [description: remove description-tail]
    print head description
    ][if/else empty? description [print "blank"][print description]]

===============================================
----- Original Message ----- 
From: "Tom" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, September 11, 2008 11:32 PM
Subject: [REBOL] Re: How to properly parse HTML and XHTML Meta Tags



> Hi Von welcome,
>
> note 1: when you initialize words with empty strings or blocks
> you *do* want to copy the empty string or block. \
> (otherwise they can be the *same* empty block or string)
>
> title: copy ""
> description: copy []
> keywords: copy []
>
>
> note 2: when using parse for more than simple string splitting get use
> to using the /all refinement and handling white space yourself.
>
> you could define a class of chars that are not "/>"  then copy some of
> them. downside is you would have to check if a "/" you ran into was
> followed by ">" and if not concatenate and continue.
> this code untested  and un-run
>
>
> tag-end: charset "/>"
> content: complement tag-end
> ...
> parse page [
>         ...
> thru "<meta name=^"keywords^" content="
> some[
>             copy token some content
>     here:                  ;;; make a pointer to where parse is
>     (append keywords token
>      all[#"/" == first :here
> #">" != second :here
> append keywords "/"
>                  here: next :here  ;;; move parse pointer over "/"
>      ])
>              :here ;;; set where pars will resume
> ]
>   thru ">"
>         ...
> ]
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> you could detect closing angle and see of the proceeding char is a slash
> and if so remove it from the copied string.
>
> note: this is running parse once not multiple times
> using braces for string that contain double quotes
> and taking the destination for the content copied
> from the meta name=<dest> i.e keyword or description block...
>
>
>    parse page [
>     thru <head>
>     some[
>     thru {<META NAME="}
>     copy dest to {"} {"}
>     thru {content=}
>     copy token to ">" here: thru ">"
>     (if #"/" = first back :here [trim/with token "/"]
>     append get to-word dest token
>     )
>     ]
>     <title> copy title to </title> tag!
>     ]
>     print title
>     print description
>     print keywords
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> but ultimately I would probably start with
>
> blk: load/markup <source>
>
> which would return a block of string! and tag!
>
> then process the tags; if I used parse I would end with
> the rule  like
> [{<META NAME="} ...  ["/>" | ">"]]
>
> note: this won't work with the
> page: read <source>
> because there may be a "/>" beyond the first ">" that closes the meta
> tag but with load/markup  each tag and string element is isolated
>
>
> hope that helps
>
>
>
> -- 
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
> 

-- 
To unsubscribe from the list, just send an email to 
lists at rebol.com with unsubscribe as the subject.

[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

Reply via email to