[REBOL] Re: parsing html : is this correct ?

Anton Thu, 06 Jun 2002 11:04:44 -0700

Jose,

Well done, you have discovered a bug in 'parse,
I think. (It could also be 'remove ?).
The following script shows the problem.
Note that html and html2 are different by one character,
the 'x' (although it doesn't seem to matter which character
it is, just the length of the string.)



html:  {<script ------------------></script><script>I should be
removed</script>}
html2: {<script -----------x-------></script><script>I should be
removed</script>}

html rule: [
        any [
                (print "~~~ any block ~~~")
                to "<script" mark1: (?? mark1)
                thru "/script>" mark2: (
                        ?? mark2
                        remove/part mark1 mark2
                        ?? mark1
                )
                :mark1
                (?? mark1)
        ] to end
]

parse/all html rule
prin "^/"
parse/all html2 rule
prin "^/"

?? html
?? html2

halt


I would like to analyse this further before making a
bug report to feedback. Better to have more information.
Anybody have any comments about this?

Anton.


> I've checked the HTML manually and the sequence of
> tags is
>
> proper set of
>
> 1. <script ... </script>
>
> and then an orphan (unnoticed by browsers)
>
> 2. </script>
>
> and finally
>
> 3. <script ... </script>
>
> The parsing stops just before the orphan </script>,
> which I don't understand . The rule should go beyond 2
> !
>
> You can check the real html at http://www.abc.es
>
> Thanks
>
> --- Anton <[EMAIL PROTECTED]> escribi�: > Jose,
> >
> > Your parse rule looks fine to me.
> > I tested out your parse rule with long
> > strings of matching <script></script> pairs,
> > but I didn't see any problems.
> >
> > I would ask you to look at your input
> > more carefully. Maybe there is something in
> > there that tricks this rule.
> >
> > Do this:
> > - Save a copy of your input.
> > - Cut selected pieces out of your input so that it
> > still
> > breaks your rule. Save each time.
> > - When you can't cut any more out, look at what you
> > have left, and if you can't figure it out, post the
> > input
> > here and we can have a look.
> >
> > Anton.
> >
> > > I use the following parse code to remove scripting
> > > from the html before I do other parsing. This
> > seems to
> > > work fine for all pages, but I just found a page
> > with
> > > lots of script tags and it only removes the first
> > 86
> > > and leaves the last one.
> > >
> > > What am I doing wrong ?
> > >
> > > Thanks
> > > Jose
> > > -----------------------------------------------
> > > parse/all html [ any [
> > >                       to "<script" mark1:
> > >                       thru "/script>" mark2:
> > >                       (remove/part mark1 mark2)
> > >                       :mark1
> > >                      ] to end
> > >                 ]

-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.

[REBOL] Re: parsing html : is this correct ?

Reply via email to