[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

Christian Ensel Fri, 12 Sep 2008 00:43:24 -0700

Hi Von,

in your special case, it doesn't seem to be necessary to go thru the > 
or /> hassle, if you rely on " as a delimiter.
But keep in mind that in many, many cases the solution below as well as 
yours will fail.
E.g. in cases where the content and name attributes are given in reverse 
order, which is valid HTML, too.


However, have a look at the following PARSE-METATAGS.

HTH,
Christian

------------------------------------------------------------------------

parse-metatags: func [page [url!] /local title keywords description] [
    page: read http://www.rebol.com

    parse page [thru <title> copy title to </title>]
    parse/all page [thru {<meta name="keywords" content="} copy keywords 
to {"}]
    parse/all page [thru {<meta name="description" content="} copy 
description to {"}]

    foreach keyword keywords: parse/all any [keywords ""] "," [trim keyword]

    reduce [
        'title title
        'keywords keywords
        'description description
    ]
]

 >> parse-metatags http://www.rebol.com
== [
    title "REBOL Technologies"
    keywords ["REBOL" "Web 3.0" "Web 2.0" "programming" "Internet" 
"software" "domain specific language" "di
stributed computing" "collaboration" "operating systems" "development" 
"rebel"]
    description {REBOL: a Web 3.0 language and system based on new 
lightweight computing methods. Site inclu
des products, downloads, documentation, and support.}
]





[EMAIL PROTECTED] schrieb:
> Thanks Tom,
> I kept on plugging away and came up with I believe
> a working script.  It's going to take some time for
> me to digest what you've written me.  I'll play around
> with yours tomorrow; I really appreciate your help!
> I've updated note 1 that you had provided me :-)
> Here's what I came up with right before you sent
> your reply.
>
> page: read http://www.rebol.com   ; webpage to be parsed
>     title: copy ""   description: copy []  keywords: copy []
>     parse page [ thru <title> copy title to </title>]
>     print title
>
>     parse page [ thru "<meta name=^"keywords^" content=" copy keywords to 
> ">" ]
>     either not none? (find/last keywords "/") [
>     keywords: tail keywords
>     keywords-tail: skip keywords -1
>     if keywords-tail = "/" [keywords: remove keywords-tail]
>     print head keywords
>     ][if/else empty? keywords [print "blank"][print keywords]]
>
>       parse page [ thru "<meta name=^"description^" content=" copy 
> description to ">" ]
>     either not none? (find/last description "/") [
>     description: tail description
>     description-tail: skip description -1
>     if description-tail = "/" [description: remove description-tail]
>     print head description
>     ][if/else empty? description [print "blank"][print description]]
>
> ===============================================
> ----- Original Message ----- 
> From: "Tom" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Thursday, September 11, 2008 11:32 PM
> Subject: [REBOL] Re: How to properly parse HTML and XHTML Meta Tags
>
>
>
>   
>> Hi Von welcome,
>>
>> note 1: when you initialize words with empty strings or blocks
>> you *do* want to copy the empty string or block. \
>> (otherwise they can be the *same* empty block or string)
>>
>> title: copy ""
>> description: copy []
>> keywords: copy []
>>
>>
>> note 2: when using parse for more than simple string splitting get use
>> to using the /all refinement and handling white space yourself.
>>
>> you could define a class of chars that are not "/>"  then copy some of
>> them. downside is you would have to check if a "/" you ran into was
>> followed by ">" and if not concatenate and continue.
>> this code untested  and un-run
>>
>>
>> tag-end: charset "/>"
>> content: complement tag-end
>> ...
>> parse page [
>>         ...
>> thru "<meta name=^"keywords^" content="
>> some[
>>             copy token some content
>>     here:                  ;;; make a pointer to where parse is
>>     (append keywords token
>>      all[#"/" == first :here
>> #">" != second :here
>> append keywords "/"
>>                  here: next :here  ;;; move parse pointer over "/"
>>      ])
>>              :here ;;; set where pars will resume
>> ]
>>   thru ">"
>>         ...
>> ]
>>
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>
>> you could detect closing angle and see of the proceeding char is a slash
>> and if so remove it from the copied string.
>>
>> note: this is running parse once not multiple times
>> using braces for string that contain double quotes
>> and taking the destination for the content copied
>> from the meta name=<dest> i.e keyword or description block...
>>
>>
>>    parse page [
>>     thru <head>
>>     some[
>>     thru {<META NAME="}
>>     copy dest to {"} {"}
>>     thru {content=}
>>     copy token to ">" here: thru ">"
>>     (if #"/" = first back :here [trim/with token "/"]
>>     append get to-word dest token
>>     )
>>     ]
>>     <title> copy title to </title> tag!
>>     ]
>>     print title
>>     print description
>>     print keywords
>>
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>
>> but ultimately I would probably start with
>>
>> blk: load/markup <source>
>>
>> which would return a block of string! and tag!
>>
>> then process the tags; if I used parse I would end with
>> the rule  like
>> [{<META NAME="} ...  ["/>" | ">"]]
>>
>> note: this won't work with the
>> page: read <source>
>> because there may be a "/>" beyond the first ">" that closes the meta
>> tag but with load/markup  each tag and string element is isolated
>>
>>
>> hope that helps
>>
>>
>>
>> -- 
>> To unsubscribe from the list, just send an email to
>> lists at rebol.com with unsubscribe as the subject.
>>
>>     
>
>   


-- 
To unsubscribe from the list, just send an email to 
lists at rebol.com with unsubscribe as the subject.

[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

Reply via email to