Hi Brad, I've attached a patch to the ticket: https://issues.apache.org/jira/browse/CONNECTORS-1215 . This patch merely tightens what the fuzzyml parser regards as a valid tag start, to adhere to the w3c specification. I don't know whether browsers do it that way or not, but it should fix the specific page you included n your post.
Please let me know if you run into further difficulties with other pages; we can look at them one at a time. Karl On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <[email protected]> wrote: > Karl, > > Thank you for investigating the issue. My concern is that I expect it's > fairly common to use '<' in embedded, uncommented, Javascript and this bug > excludes any content that appears after one and before a second end script > tag from being crawled with ManifoldCF. Unfortunately, I don't have any > suggestions other than using a stack to push open tags onto and pop off > when an end tag is seen. I believe that would satisfy your example, but > who knows what other problems a stack brings. > > Do you have any suggestions for work arounds I could implement locally? > > Thanks, > Brad > > -----Original Message----- > From: Karl Wright [mailto:[email protected]] > Sent: Wednesday, June 24, 2015 9:33 AM > To: dev > Subject: Re: Webconnector: Comparison operator '<' in the body of a script > tag > > Brad, > > The issue is complex because according to spec the code is doing the right > thing. Typically, <script> blocks look something like this: > > <script ...> > <!-- > > ... > > //--> > </script> > > The reason for the comment area is because without it, tags within the > script block are supposed to be recognized as such, even if they are > ignored. Within comments, this does not happen, of course, which is why > comments are used. > > I don't believe it is a real standard, but some browsers try to interpret > script blocks differently even when no comment is given. We can try to > emulate that behavior but it is likely that our emulation will not work for > all web pages, since it's not a standard. Exploring how this works on > various browsers would be the first step. Specifically, if you do > something like this: > > <script ...> > > foo = "<script></script>"; > bar = "hello"; > > </script> > > ... what happens? Does the script end at the first </script>, or the > second? And, in what browsers? > > Until we get more clarity it's going to be hard to do a feature that > actually helps rather than hurts... > > Karl > > > On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <[email protected]> wrote: > > > Hi Brad, > > > > I've created a ticket: CONNECTORS-1215. Looking into this now. > > > > Karl > > > > > > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis > > <[email protected] > > > wrote: > > > >> Hi, > >> > >> There appears to be a bug in the TagParseState when the comparison > >> operator '<' is encountered in the body of a script tag. It > >> appears to get flagged as an open tag and then the next '</' closes > >> it. In my case, the next '</' is the script tag. The > >> ScriptParseState chomps everything until it encounters a second > </script> tag. > >> > >> A live link that demonstrates this bug is here: > >> > >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30 > >> -days-page-1-pagesize-20 > >> > >> The '<' near line 2826 in the script body that begins near line 2759 > >> begins a new tag 'arraykeywords.length' which gets closed by the '</' > >> in the closing script tag. The ScriptParseState chomps all the html > >> until it sees the end script tag near line 3385. > >> > >> At the moment, I'm not sure of a solution other than pushing the > >> script tag handling up to the TagParseState and treating it like CDATA > is. > >> > >> > >> Thanks, > >> > >> Brad Dennis > >> > >> > >> > > >
