[REBOL] Re: R: Re: Help on parsing

Tom Conlin Mon, 15 Mar 2004 20:49:16 -0800

Hi Giuseppe Chillemi

what you are asing for is not too much
I cant be sure where your lines were suppose to end but
I assume it is after the <br> after the phone number.

one way to be sure that you do not skip over too much looking for the fax
number is to break it into individual lines when you read it.
(if it is stored with a record per row) sonething like:

foreach line read/lines %file [parse line rule]


but you can also parse the whole thing by building a parse rule that looks
at each line one at a time. the rule below could be made more flexible by
writing rules to handle whitespace i.e.  ws: [[any " "] | [any tab]]
and sticking it in different places. but hopefully this will help.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;;; the sequence of "N" mean: 3 or more numbers to
;;; undefined number of numbers

digit: charset {0123456789}
phone: [3 digit any digit " " 3 digit some digit]

;;; could also say
;;; phone: [3 4 digit " " 7 9 digit]
;;; or what ever number, if you knew the ranges


;;; an object to store a record ... could also use a simple block
mark: make object! [
        name:    copy ""
        address: copy ""
        phone:   copy ""
        fax:     copy ""
]

;;; a block to store the objects in
marks: copy []

;;; parse rule for a line -- I just use the word 'token' out of habit
line: [ (m: make mark[])
        any newline ; there may not be one at the start/end
        "name-keyword " copy token to " " (m/name: token) " "
        thru "address-keyword"  copy token to <br> (m/address: token) <br>
        thru "Tel.: "    copy token phone (m/phone: token)
        opt [" - Fax.: " copy token phone (m/fax: token)]
        any " "
        <br>
        (append marks m)
]

;;; mind the wrap
page:
{name-Keyword NAME-VALUE unusefulltext address-keyword ADDRESS-VALUE <BR>
unusefulltext Tel.: 1234 12345678 <BR>
name-Keyword NAME-VALUE unusefulltext address-keyword ADDRESS-VALUE <br>
unusefulltext Tel.: 1234 567891011 - Fax.: 1234 110198765 <BR>
}

parse/all page [some line]

probe marks

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;



On Sun, 14 Mar 2004, Giuseppe Chillemi wrote:

>
> > > 1)  "KW1 555 <br> KW1 333 KW2 444 <br>"
> > >
> > > 2)  "KW1 555 KW2 666 <br> KW2 444 <br>"
> > >
> > >
> > >
> > > I need to extract the value of KW1 and KW2 , or KW1 itself.
> >                                                      ^ value?
>
> (Yes)
>
> The whole problem is to parse a page and its addresses with phone and fax
>
> I may have an input string formed in this way:
>
> UNUSEFULL-TEXT name-prefix-keyword(type 1 or type2) NAME address-keyword
> ADDRESS-DATA       one or more of     telephone-prefix-keyword TELEPHONE-NUM
>
>      zero or more of      "-" fax-prefix-keyword FAX-NUM        and finally
> a    <BR>
>
> The whole sequence repeats until the end of page is reached.
>
> So, let me extend the strings of the previous message:
>
> "KW1 555 <br> KW1 333 KW2 444 <br>"
>
> "KW1 555 KW2 666 <br> KW2 444 <br>"
>
> KW1 = "Tel.:"
> KW2 = "Fax."
>
> Let's use the new information. The 2 ways the address block could appear
> are:
>
> 1) "name-Keyword NAME-VALUE unusefulltext address-keyword ADDRESS-VALUE <BR>
> unusefulltext Tel.: NNNN NNNNNNNN <BR>"
>
> Or
>
> 2) "name-Keyword NAME-VALUE unusefulltext address-keyword ADDRESS-VALUE <br>
> unusefulltext Tel.: NNNN NNNNNNNN - Fax.: NNNN NNNNNNN <BR>"
>
> Note that the sequence of "N" mean: 3 or more numbers to undefined number of
> numbers
>
> If I parse the following text...
>
> --- TEXT TO PARSE ---
> "name-Keyword NAME-VALUE unusefulltext address-keyword ADDRESS-VALUE <BR>
> unusefulltext Tel.: NNNN NNNNNNNN <BR>"
>
> (some unusefulltext)
>
> "name-Keyword NAME-VALUE unusefulltext address-keyword ADDRESS-VALUE <br>
> unusefulltext Tel.: NNNN NNNNNNNN - Fax.: NNNN NNNNNNN <BR>"
> --- END TEXT TO PARSE ---
>
> ....Using a logic like this:
>
> > ;;; a recursive parse rule to copy the value from an unknown number of
> > ;;; consecutive "KW value" pairs in a string
> > ;;; possibaly separated with <br>
> >
> > rule: [
> >     "KW" ["1 " | "2 "]       ; what the parser needs to reconize
> >     copy result integer!     ; may not be integer in real case
> >     (append store result)    ; store/use result immediatly
> >     opt <br>                 ; there might be a trailing <br>
> >     opt rule                 ; there might be another KW to reconize
> > ]
>
> What prevents the sequence from going to the next address block ?
>
> Searching for "FAX.:" , in any wait I think to try to search it, let the
> parse instruction move to the next address block where it can find a FAX
> keyword !
>
> I need to find a way to tell Rebol to search for "FAX.:" before the <BR>
> keyword and not after. In any way I tell to myself the correct rule it is
> implicit that REBOL will search in the whole text and not until <BR>
>
> However, a simple solution is to split the problem in 2 problems: searching
> for "tel." to <br> and than parsing the resulting string (which may have a
> FAX inside of it) using another routine.
>
> But, is there a way to solve this problem using a single parse instruction ?
>
> Thanks again
>
> Giuseppe Chillemi
>
>
>
>
>
> --
> To unsubscribe from this list, just send an email to
> [EMAIL PROTECTED] with unsubscribe as the subject.
>
-- 
To unsubscribe from this list, just send an email to
[EMAIL PROTECTED] with unsubscribe as the subject.
[REBOL] Re: R: Re: Help on parsing

Reply via email to