[REBOL] Re: parsing html : is this correct ?

jose Thu, 06 Jun 2002 16:44:09 -0700

I thought the bug was in parse, not remove because I
tested this without the remove, just checking how
'parse iterates over the text string


After looking at your example I'm quite confused, I
think more people have to see this before it's a bug.
We need to be missing something otherwise this would
be a significant bug ! 
--- Anton <[EMAIL PROTECTED]> escribi�: > Jose,
> 
> Well done, you have discovered a bug in 'parse,
> I think. (It could also be 'remove ?).
> The following script shows the problem.
> Note that html and html2 are different by one
> character,
> the 'x' (although it doesn't seem to matter which
> character
> it is, just the length of the string.)
> 
> 
> html:  {<script
> ------------------></script><script>I should be
> removed</script>}
> html2: {<script
> -----------x-------></script><script>I should be
> removed</script>}
> 
> html rule: [
>       any [
>               (print "~~~ any block ~~~")
>               to "<script" mark1: (?? mark1)
>               thru "/script>" mark2: (
>                       ?? mark2
>                       remove/part mark1 mark2
>                       ?? mark1
>               )
>               :mark1
>               (?? mark1)
>       ] to end
> ]
> 
> parse/all html rule
> prin "^/"
> parse/all html2 rule
> prin "^/"
> 
> ?? html
> ?? html2
> 
> halt
> 
> 
> I would like to analyse this further before making a
> bug report to feedback. Better to have more
> information.
> Anybody have any comments about this?
> 
> Anton.
> 
> 
> > I've checked the HTML manually and the sequence of
> > tags is
> >
> > proper set of
> >
> > 1. <script ... </script>
> >
> > and then an orphan (unnoticed by browsers)
> >
> > 2. </script>
> >
> > and finally
> >
> > 3. <script ... </script>
> >
> > The parsing stops just before the orphan
> </script>,
> > which I don't understand . The rule should go
> beyond 2
> > !
> >
> > You can check the real html at http://www.abc.es
> >
> > Thanks
> >
> > --- Anton <[EMAIL PROTECTED]> escribi�: > Jose,
> > >
> > > Your parse rule looks fine to me.
> > > I tested out your parse rule with long
> > > strings of matching <script></script> pairs,
> > > but I didn't see any problems.
> > >
> > > I would ask you to look at your input
> > > more carefully. Maybe there is something in
> > > there that tricks this rule.
> > >
> > > Do this:
> > > - Save a copy of your input.
> > > - Cut selected pieces out of your input so that
> it
> > > still
> > > breaks your rule. Save each time.
> > > - When you can't cut any more out, look at what
> you
> > > have left, and if you can't figure it out, post
> the
> > > input
> > > here and we can have a look.
> > >
> > > Anton.
> > >
> > > > I use the following parse code to remove
> scripting
> > > > from the html before I do other parsing. This
> > > seems to
> > > > work fine for all pages, but I just found a
> page
> > > with
> > > > lots of script tags and it only removes the
> first
> > > 86
> > > > and leaves the last one.
> > > >
> > > > What am I doing wrong ?
> > > >
> > > > Thanks
> > > > Jose
> > > >
> -----------------------------------------------
> > > > parse/all html [ any [
> > > >                       to "<script" mark1:
> > > >                       thru "/script>" mark2:
> > > >                       (remove/part mark1
> mark2)
> > > >                       :mark1
> > > >                      ] to end
> > > >                 ]
> 
> -- 
> To unsubscribe from this list, please send an email
> to
> [EMAIL PROTECTED] with "unsubscribe" in the 
> subject, without the quotes.
>  

_______________________________________________________________
Copa del Mundo de la FIFA 2002
El �nico lugar de Internet con v�deos de los 64 partidos. 
�Ap�ntante ya! en http://fifaworldcup.yahoo.com/fc/es/
-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.

[REBOL] Re: parsing html : is this correct ?

Reply via email to