[ 
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161170#comment-13161170
 ] 

Mark Bennett commented on CONNECTORS-275:
-----------------------------------------

> So the link from the timeout page sends you to login.cgi without any 
> parameters at all, and yet login.cgi requires parameters to perform the login?

I believe so, need to verify.  8 different sites = 8 slightly different 
behaviors.

> Or (I've seen this done before) when you go to http://mysite.com/login.cgi, 
> do you get the form at that time...

I wish!  At least on some sites, no.

And worse(!) on some sites, if you just go to login.cgi with no parms, you get 
a nasty error, like maybe a 500.

So that'd be another problem - to tell MCF to ignore even severe errors (so 
that we can have the 2 step rule)

> But how would that new login page actually work? Should it match the URL 
> regexp only, or should there be some other identifying characteristic on the 
> page itself?

Not sure I'm directly answering this....

But this might be where my habits with other spiders are different enough than 
MCF's that maybe there's implicit "unlearn *that*!" in my near future.

I'd classify MCF as a reactive pattern matcher.  It can do almost anything 
based on what it gets back.

Whereas I was thinking more proactive "IF you see url-A THEN GOTO 
arbitrary-url-B", where the ONLY place literal url-B exists is in the config 
screen.  In that scenario, where I can inject arbitrary new URLs via 
configuration, then to me it looks "easy".

In that scenario (arbitrary config injection) we solve all the problems at 
once.  A URL with ? arg=value & arg=value IS a GET, so no config there.  And I 
get to specify the args inline, in the URL.

This is inelegant as a general solution.  I can enumerate a few right here: 
What if it needed to be a POST after all?  What if my parameters are long and 
have spaces and need URL encoding - then I'd have to encode them manually.  
Editing 1.5k URLs in a 3 inch HTML web form is UGLY.  And what if I didn't know 
the exact URL, but I could calculate it based on some other state?

MCF's model handles all those other items in a much more general, re-usable 
way.  Whereas the special case of "I just need it to fetch this arbitrary 200 
character URL" almost seems like a degenerate use case which coincidently has 
an easy fix.  And my only response to that, arguing both sides of the coin 
here, is that this might be a much more common "edge case" than a software 
architect might assume.

Do the last few paragraphs make sense?  And did it answer your question?

BTW Karl, this is probably the most detailed (and to me interesting) 
conversation I've had with anybody about the minutia of URLs and logins in a 
while.  Normally I'd coral an engineer in front of a whiteboard, but this is 
more like how they used to play chess, via US Mail, kinda fun!



                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to 
> improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then 
> disappointed when it referred back to the online doc for setting up logins 
> for a Web spidering. The online doc is very vague and only gives one example. 
> I've used Ultraseek's and Google's spider, but I still find the Session login 
> sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the 
> parts that are not clear.
> I generally understand about using regexes to define sites and sorting out 
> content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" 
> regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying 
> "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the 
> case of the site I'm trying to spider, when your session expires, you 
> manually go back to an https page and supply your username and password as 
> CGI parameters. I know this sounds odd, but it's apparently how a number of 
> the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples 
> of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: 
> https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an 
> error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to