[
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159945#comment-13159945
]
Karl Wright commented on CONNECTORS-275:
----------------------------------------
Ok, let's start with some basics. First, the goals of all the setup you have
to go through are as follows:
(1) Identify what site, or part of the site, has protected content;
(2) Identify which http/https fetches are not content, but are in fact part of
a "login sequence", which a normal person has to go through to get the
appropriate cookies.
One of the regexps you supply (the first one) basically describes the set of
URLs for which the content is protected, and for which the right cookies have
to be in place for you to get at the "real" content. Once you've specified
this, then for each protection zone (described by its URL regexp), you need to
specify how ManifoldCF should identify whether a given fetch should be
considered part of the login sequence or not. It's not enough to just identify
the URL of login pages, since (for instance) if your session has expired you
may well have a redirection get fetched instead of the content you want. So
you specify each class of login page as one of three types, using not only the
URL to identify the class (this is where you get the second regexp), but also
something about what is on the page: whether it is a redirection to a URL (yes,
again described by a URL regexp), whether it has a form with a specified name
(described by a regexp), or whether it has a specific link on it (once again,
described by a regexp).
You will note in all three case above that there is an implicit flow through
the login sequence that you can describe as part of specifying the login
sequence. For example, if upon session timeout you expect to see a redirection
to a link, or family of links (remember, it's regexp, so you can describe that
easily), then as part of identifying the redirection as belonging to the login
sequence, the web connector also now has a new link to fetch. And this is what
it does. The same applies to forms; if the form name that was specified is
found, then the web connector submits that form using values for the form
elements that you specify, and using the submission type actually mentioned on
the form page (GET, POST, or multi-part). Any other elements of the form are
left in whatever the HTML specified; no Javascript is evaluated. So if you
think a form element's value is being set by Javascript, you have to figure out
what it is being set to and enter this value by hand as part of the
specification for the "form" type of login page. Usually this only amounts to
a user name and password.
As far as your site, which redirects you to a page when session has expired,
you would need two specifications for login pages to cover the situation - one
for the redirection itself, and one for the page that you get redirected to.
Usually in these situations the target page has at least a link on it that
takes you back to the main login form, and that is what you'd use to identify
it (it would be a 'link' type login page, where you'd specify the target URL of
the link itself using a regexp). As I said before, if there is no way at all
to navigate back from a session expiration to the login form, and the user has
to just type the login page URL into his browser again, then the web connector
will need another type of login page to model this behavior. It's not hard to
add and I'm willing to do it, but first I really want to know if there are
production sites out there that are so user-unfriendly. ;-)
Now, to answer some of your specific questions:
bq. A further complication is that, on many of the sites, some of the
"redirects" and other actions are done with Javascript.
Javascript can only execute after the page is loaded, while a true redirection
(which is done by a return code) precludes any Javascript execution. If you
believe that redirection is handled by Javascript on this site, that implies
that the site's content pages actually load, and then Javascript decides
there's been a session timeout, and redirects you away. But the content is in
fact all there, and there would be no need to log in at all to crawl the site.
That can't be right! You'll want to research exactly what is happening; I
recommend LiveHttpHeaders on FireFox to figure out what's happening in detail.
bq. The previous instance of the app we're rewriting was using WebHarvest,
which seemed to have a single "magic" boolean flag for handling some of the
Javascript appropriately, though I don't know the mechanics of it.
I'm afraid "magic" is above my pay grade at the moment. If you learn what they
are doing we can look into it further.
bq. I think it gives some other warning page, maybe with a link, which might be
a javascript link, not sure.
If everything runs through javascript, that's a problem. For 'link' type login
pages, the web connector only looks for html that it recognizes as a link, e.g.
<a href="...">..</a>. While you could identify the "session expired" page
using a link that executes a javascript function perhaps, you could not
actually execute that javascript, so the web connector would not know where to
go next.
I suggest you do enough research, or point me at the site if it's public, in
order to understand what the site is doing before going further.
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
> Key: CONNECTORS-275
> URL: https://issues.apache.org/jira/browse/CONNECTORS-275
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Documentation, Web connector
> Affects Versions: ManifoldCF 0.4
> Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to
> improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then
> disappointed when it referred back to the online doc for setting up logins
> for a Web spidering. The online doc is very vague and only gives one example.
> I've used Ultraseek's and Google's spider, but I still find the Session login
> sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the
> parts that are not clear.
> I generally understand about using regexes to define sites and sorting out
> content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL"
> regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying
> "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the
> case of the site I'm trying to spider, when your session expires, you
> manually go back to an https page and supply your username and password as
> CGI parameters. I know this sounds odd, but it's apparently how a number of
> the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples
> of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL:
> https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an
> error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira