[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

Tom Thu, 11 Sep 2008 23:33:41 -0700

[EMAIL PROTECTED] wrote:
> Hello Rebol Group,
> 
> I'm a bit new, I have a couple of the Rebol books and have gone
> over the different tutorial a few times but I'm having trouble with
> the following code of mine.
> 
> For example:
> I'm attempting to parse the meta tags but the tag can end in either
> ">" or "/>"
> 
> I've tried to write the below script a different way, over 50 times,
> but to no avail.  I don't know how to properly code it where it will
> check for either ending tag ">" or "/>"
> 
> sample meta tag:
> <meta name="description" content="Having trouble with this below script" />
> 
> The end result should look like:
> "Having trouble with this below script"
> -not-
> "Having trouble with this below script" /
> 
> If I change the script from ">" to "/>" and the meta tag is
> <meta name="description" content="Having trouble with this below script">
> 
> Then the script will not catch the ">" since it's looking for "/>"
> 
> REBOL CODE:
> page: read http://www.rebol.com     ; webpage to be parsed
>     title: []   description: []   keywords: []
>     parse page [ thru <title> copy title to </title>]
>     parse page [ thru "<meta name=^"keywords^" content=" copy keywords to 
> ">" ]
>       title: copy ""
     description: copy []
     keywords: copy []


> 
>     print title
>     print description
>     print keywords
> 
> Thank you in advance for your assistance.
> 
> Regards,
> Von
> 


Hi Von welcome,

note 1: when you initialize words with empty strings or blocks
you *do* want to copy the empty string or block. \
(otherwise they can be the *same* empty block or string)

title: copy ""
description: copy []
keywords: copy []


note 2: when using parse for more than simple string splitting get use 
to using the /all refinement and handling white space yourself.

you could define a class of chars that are not "/>"  then copy some of 
them. downside is you would have to check if a "/" you ran into was 
followed by ">" and if not concatenate and continue.
this code untested  and un-run


tag-end: charset "/>"
content: complement tag-end
...
parse page [
         ...
        thru "<meta name=^"keywords^" content="
        some[
             copy token some content
            here:                  ;;; make a pointer to where parse is
            (append keywords token
             all[#"/" == first :here
                 #">" != second :here
                 append keywords "/"
                  here: next :here  ;;; move parse pointer over "/"
             ])
              :here     ;;; set where pars will resume
        ]
        thru ">"
         ...
]

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

you could detect closing angle and see of the proceeding char is a slash 
and if so remove it from the copied string.

note: this is running parse once not multiple times
using braces for string that contain double quotes
and taking the destination for the content copied
from the meta name=<dest> i.e keyword or description block...


    parse page [
        thru <head>
        some[
                thru {<META NAME="}
                copy dest to {"} {"}
                thru {content=}
                copy token to ">" here: thru ">"
                (if #"/" = first back :here [trim/with token "/"]
                 append get to-word dest token
                )
        ]
        <title> copy title to </title> tag!
     ]
     print title
     print description
     print keywords

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

but ultimately I would probably start with

blk: load/markup <source>

which would return a block of string! and tag!

then process the tags; if I used parse I would end with
the rule  like
[{<META NAME="} ...  ["/>" | ">"]]

note: this won't work with the
page: read <source>
because there may be a "/>" beyond the first ">" that closes the meta 
tag but with load/markup  each tag and string element is isolated


hope that helps



-- 
To unsubscribe from the list, just send an email to 
lists at rebol.com with unsubscribe as the subject.

[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

Reply via email to