[ 
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159906#comment-13159906
 ] 

Mark Bennett commented on CONNECTORS-275:
-----------------------------------------

Hi Karl, this is Mark from the original book site post.  I appreciate your 
actions and will try to clarify your questions.

It's funny, once somebody understands how Manifold works, I sure rereading the 
doc would seem correct and "obvious" - this happens a lot with tech.  But I'm 
not there yet.

For myself, I've shifted from writing thorough doc, and instead providing 
minimal doc and many examples.  Maybe I can contribute more to that when I get 
my sea legs with Manifold.

To your comments:
> It would be great to hear some clarification on why pages that obviously 
> would be needed
> for a user to log into this site using a browser do not exist.

Rereading my post I'm not sure which part this referred to.  But I think it 
might be related to this:

I think Manifold assumes that, when another login is needed, a redirect will be 
issued to a login form.  And, given that, we give it regexes to tell which 
redirects are normal content vs. logins.

But in this system, when the session is expired, it doesn't do a redirect to 
that login form.  I think it gives some other warning page, maybe with a link, 
which might be a javascript link, not sure.

So, when we know a session has expired, we need to tell Manifold the literal 
URL to go back to.  I'm thinking now that Manifold just doesn't support that 
function at this time?  So maybe the "how do I configure this" is that "you 
don't!" (currently).  I've downloaded the code and starting to poke around in 
Eclipse, to maybe extend it.

Although this may sound like an odd edge case, it appears to be quite common 
with the class of sites we're dealing with.

The rest of your comments made sense, and I'm incorporating them.

Here are some other specific thoughts on the doc, giving you the newb 
perspective.  I'm being specific not to badger, but trying to capture specific 
areas of confusion, and hoping to provide more actionable items then just 
"needs better doc".

There were two "sets" of items that could use a bit more narration.  This is 
already started in the current doc, and you've expanded it above, but I'll 
enumerate it here.

In the UI there are three "Regex" labels, literally:
1: "URL regular expression"
2: "Login URL regular expression"
3: "Form name/link target regular expression"

These are similar enough to confuse us newbs, and again you've addressed some 
of this above.

Then there are three Page types:
1: "Form name"
2: "Link target"
3: "Redirection"

These are already mentioned in the doc, I realize that.  But exactly how each 
item from the two lists interact, in which combinations, is murky to us newbs.

I had already asked about "what if there's no form", which you've now answered. 
 Confirming that, even if there's no form, it'll still use the name value pairs?

I'm also not sure how you'd tell the system to use a GET vs. a POST, when 
submitting a form.

A further complication is that, on many of the sites, some of the "redirects" 
and other actions are done with Javascript.  I haven't gone far enough into the 
doc see if/how that's handled.  The previous instance of the app we're 
rewriting was using WebHarvest, which seemed to have a single "magic" boolean 
flag for handling some of the Javascript appropriately, though I don't know the 
mechanics of it.

Look forward to continuing the dialog, thanks again Karl!
Mark
                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to 
> improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then 
> disappointed when it referred back to the online doc for setting up logins 
> for a Web spidering. The online doc is very vague and only gives one example. 
> I've used Ultraseek's and Google's spider, but I still find the Session login 
> sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the 
> parts that are not clear.
> I generally understand about using regexes to define sites and sorting out 
> content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" 
> regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying 
> "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the 
> case of the site I'm trying to spider, when your session expires, you 
> manually go back to an https page and supply your username and password as 
> CGI parameters. I know this sounds odd, but it's apparently how a number of 
> the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples 
> of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: 
> https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an 
> error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to