Julien Massiera created CONNECTORS-1623:
-------------------------------------------

             Summary: Script tags not ignored
                 Key: CONNECTORS-1623
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
             Project: ManifoldCF
          Issue Type: Bug
          Components: Web connector
    Affects Versions: ManifoldCF 2.13
            Reporter: Julien Massiera


I discovered a problematic behavior with the 
org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when crawling 
web pages. This behavior poses problem in particular for the scenario of form 
based authentication, as explained further. 

 The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
is called by the TagParseState on each noteTag() or noteEndTag() methods, uses 
the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState class 
to detect if the parsing process is in or out of a 'script' tag and then do 
something or not with the incoming data. The problem is that the TagParseState 
class is not aware of the type of tag currently parsed, so it continues to 
analyze any char encountered to detect tags even if it is actually parsing a 
script tag. 

So let's imagine you have a script tag built like this in a web page: 
{code:java}
<script>if(myvar <= 9) {.......}</script>
{code}

When the TagParseState parses the char '<' it will consider that a new tag 
begins until it encounters a '>' char. So in the case above, the TagParseState 
will never catch the end of the script tag, and thus, the scriptParseState 
variable in the ScriptParseState class will remain in the 
SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
correctly handled by the other parsers. 

 As a result, if you, for example, configure a form authentication for your 
crawl and that the form web page contains this kind of script tag prior to the 
form tag, the form will never be handled and the authentication will fail. This 
was the case I encountered, and I resolved it by forcing the scriptParseState 
to be SCRIPTPARSESTATE_NORMAL.

ref : 
[http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to