[
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161116#comment-13161116
]
Mark Bennett commented on CONNECTORS-275:
-----------------------------------------
Hi Karl,
Agreed on pretty much every point, I'll need to do some research this weekend.
The additional explanations about regexes are actually very helpful.
In thinking more about this, I may have mis-spoke earlier about what happens
when a login is needed. Let's say a session has expired, and the entire site
matches the first regex.
1: Attempted Fetch: http://mysite.com/page1.html, but my session has timed out.
2: Redirected to: http://mysite.com/session-timeout-message.html, which does
NOT have a form, but DOES have a link to a login page, login.cgi
3: Using rules I could tell MCF to Fetch: http://mysite.com/login.cgi
Here's the issue:
On login.cgi there is no form.
In Step 3 above, what I'd really want to do is say:
Fetch: http://mysite.com/login.cgi?username=me&password=hello
>From what you've said, I think I would either need to:
A: Keep step 3, and add a step 4 with the parameters
or
B: Modify step 3 to include arguments
I'm assuming Method A is closer to what you described:
Method A:
* (while on mysite.com/session-timeout-message.html which has a link to
login.cgi)
3: (same as above, matching timeout-msg.html) Tell MCF to Fetch:
http://mysite.com/login.cgi
4: (new, matching login.cgi) Tell MCF that the form name is ^$, and that the
parameters are username=me and password=hello.
The only issue here is that, since there is no form on login.cgi, there's no
"method=GET" to inherit from.
Is this closer to what you were saying? And as I said, I need to verify that
this is exactly what's happening.
WRT Coding:
If more code needs to be written, I wasn't necessarily bugging you to write it
(though you'd be faster at it!)
Sadly the sites are under NDA (eCommerce stuff). If I got completely stuck and
couldn't code my way out of it, and you still had time to volunteer, then maybe
we could talk about NDA's, but that's way over the line of what I'd expect from
another volunteer developer.
The good news is that this coding (with Manifold) is on my own time, in
frustration with the legacy code, so although I couldn't share the specific
logins, the resulting code would be unencumbered. WebHarvest is interesting,
but seems like if you want any threading, persistence or job management you get
to write it yourself, and thus MCF seems way more attractive. ;-)
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
> Key: CONNECTORS-275
> URL: https://issues.apache.org/jira/browse/CONNECTORS-275
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Documentation, Web connector
> Affects Versions: ManifoldCF 0.4
> Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to
> improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then
> disappointed when it referred back to the online doc for setting up logins
> for a Web spidering. The online doc is very vague and only gives one example.
> I've used Ultraseek's and Google's spider, but I still find the Session login
> sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the
> parts that are not clear.
> I generally understand about using regexes to define sites and sorting out
> content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL"
> regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying
> "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the
> case of the site I'm trying to spider, when your session expires, you
> manually go back to an https page and supply your username and password as
> CGI parameters. I know this sounds odd, but it's apparently how a number of
> the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples
> of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL:
> https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an
> error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira