Karl,

The patch is working.  Thank you very much!  Also, thank you for your 
clarification on the behavior of the parser.  It's pretty complex.

Brad

-----Original Message-----
From: Karl Wright [mailto:[email protected]] 
Sent: Wednesday, June 24, 2015 10:04 AM
To: dev
Subject: Re: Webconnector: Comparison operator '<' in the body of a script tag

Hi Brad,

I've attached a patch to the ticket:
https://issues.apache.org/jira/browse/CONNECTORS-1215 .  This patch merely 
tightens what the fuzzyml parser regards as a valid tag start, to adhere to the 
w3c specification.  I don't know whether browsers do it that way or not, but it 
should fix the specific page you included n your post.

Please let me know if you run into further difficulties with other pages; we 
can look at them one at a time.

Karl


On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <[email protected]>
wrote:

> Karl,
>
> Thank you for investigating the issue.  My concern is that I expect 
> it's fairly common to use '<' in embedded, uncommented, Javascript and 
> this bug excludes any content that appears after one and before a 
> second end script tag from being crawled with ManifoldCF.  
> Unfortunately, I don't have any suggestions other than using a stack 
> to push open tags onto and pop off when an end tag is seen.  I believe 
> that would satisfy your example, but who knows what other problems a stack 
> brings.
>
> Do you have any suggestions for work arounds I could implement locally?
>
> Thanks,
> Brad
>
> -----Original Message-----
> From: Karl Wright [mailto:[email protected]]
> Sent: Wednesday, June 24, 2015 9:33 AM
> To: dev
> Subject: Re: Webconnector: Comparison operator '<' in the body of a 
> script tag
>
> Brad,
>
> The issue is complex because according to spec the code is doing the 
> right thing.  Typically, <script> blocks look something like this:
>
> <script ...>
> <!--
>
> ...
>
> //-->
> </script>
>
> The reason for the comment area is because without it, tags within the 
> script block are supposed to be recognized as such, even if they are 
> ignored.  Within comments, this does not happen, of course, which is 
> why comments are used.
>
> I don't believe it is a real standard, but some browsers try to 
> interpret script blocks differently even when no comment is given.  We 
> can try to emulate that behavior but it is likely that our emulation 
> will not work for all web pages, since it's not a standard.  Exploring 
> how this works on various browsers would be the first step.  
> Specifically, if you do something like this:
>
> <script ...>
>
> foo = "<script></script>";
> bar = "hello";
>
> </script>
>
> ... what happens?  Does the script end at the first </script>, or the 
> second?  And, in what browsers?
>
> Until we get more clarity it's going to be hard to do a feature that 
> actually helps rather than hurts...
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <[email protected]> wrote:
>
> > Hi Brad,
> >
> > I've created a ticket: CONNECTORS-1215.  Looking into this now.
> >
> > Karl
> >
> >
> > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis 
> > <[email protected]
> > > wrote:
> >
> >> Hi,
> >>
> >> There appears to be a bug in the TagParseState when the comparison 
> >> operator '<'  is encountered in the body of  a script tag.  It 
> >> appears to get flagged as an open tag and then the next '</' closes 
> >> it.  In my case, the next '</' is the script tag.  The 
> >> ScriptParseState chomps everything until it encounters a second
> </script> tag.
> >>
> >> A live link that demonstrates this bug is here:
> >>
> >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-
> >> 30
> >> -days-page-1-pagesize-20
> >>
> >> The '<' near line 2826 in the script body that begins near   line 2759
> >> begins a new tag 'arraykeywords.length' which gets closed by the '</'
> >> in the closing script tag.  The ScriptParseState chomps all the 
> >> html until it sees the end script tag near line 3385.
> >>
> >> At the moment, I'm not sure of a solution other than pushing the 
> >> script tag handling up to the TagParseState and treating it like 
> >> CDATA
> is.
> >>
> >>
> >> Thanks,
> >>
> >> Brad Dennis
> >>
> >>
> >>
> >
>

Reply via email to