roel goovaerts created CONNECTORS-1598:
------------------------------------------

             Summary: session based authentication cannot register 401
                 Key: CONNECTORS-1598
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1598
             Project: ManifoldCF
          Issue Type: Bug
          Components: Web connector
    Affects Versions: ManifoldCF 2.12
            Reporter: roel goovaerts


Description:

Access to a specific domain is restricted by being A) an intranet service B) 
based on an employee/costumer profile.
For manifold to be able to be authenticated there is a specific 
'\{{domain}}/login' page with a form where manifold was configured to enter 
it's username and password. A session-cookie is then set so manifold is 
authenticated to access all resources. If a request for a resource is not 
authenticated the service throws a 401. When the service returns a 401 the 
actual content of the resource includes the same form as is present in 
'\{{domain}}/login'.

Problem:

The only way we have been able to configure manifold to be authenticated was by 
specifying session-based credentials AND providing '\{{domain}}/login' as a 
seed in the job as well. The only other seed in the job is a sitemap.
This is of course not ideal since it can easily happen that the seed for the 
sitemap gets processed first, which then throws a 401 on the sitemap and the 
job stops.
Another possible scenario with this configuration is that the cookie expires 
and all other resources throw 401 and get deleted from the index 
(elasticsearch). There is also another job (different language, same domain), 
usage of the cookie from the previous job has also been registered.

Current session-based access credentials configuration:

--url regular expression : https://\{{domain}}/
--login pages: 
---login url regexp : 'login'
---page type : form
---identification regexp is set to match the form-name
---form parameters are filled with the correct parameters

This is verified to work, but as my understanding this only works because the 
login-page is part of the seeds and so it matches the url when it comes across 
it when crawling. There is no configuration yet which redirects (for example) 
to this page when manifold receives a 401.

My goal was then to remove the login-page from the seeds and configure the job 
so that each time a fetch returns a 401, manifold knows to go to the login 
page. in pseudo code:

--If authenticated
---process 
--else
---redirect to login
---retry resource

 

Based on the documentation here: 
https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository
 I tried a few different configurations. The first thing to notice is in the 
comparison table, 'page based authentication' only mentions 4xx and 'session 
based authentication' only mentions 3xx.

At this time my biggest question is; are these response codes bound to the 
difference in settings between page and session based? As far I have been able 
to see, whenever manifold receives a 401 it logs "ignoring url \{‌{url}‌} 
because it failed to fetch (status=401, ..."
Am I not able to work with session based authentication when the service 
returns 401's?

 

Configuration attempts (all failed):
- for all attempts the login page was removed from the seeds.
- in general I have kept the above configuration of page type 'form', in the 
case I was able to redirect manifold to this page.
- The kinds of content that a web connection can recognize as a login page 
specified in the documentation lists an option "A page that has specific 
content on it, as described by a regular expression". As the description of 
this case specified I tried the page type 'content' setting, with 
identification regexp set to '.*' for testing and an override url set to 
'\{{domain}}/login'. My hopes were that in this test the match-all-regexp would 
override to the login page for every url it fetches.
- Since the content of a 401 also includes the same form as the login page, i 
tried with page type 'form', supplied identification regexp en override form 
parameters, just like above, only with the "login url regexp" set to '.*'. My 
hopes were that each page has the possibility to have the form recognized if it 
is returned as a 401.

In both cases the only thing I could see is that manifold fetched the sitemap, 
received a 401 and in manifold logged "ignoring url \{‌{url}‌} because it 
failed to fetch (status=401, ..."

Some questions:
- Is there anything to be done when manifold receives a 401?
- is 4xx tied to page base authentication and 3xx tied to session based 
authentication?
- is there some other configuration/logic that I am missing, that I could try 
out?



A minimal effort solution would be if there was a way to make manifold start at 
the login and not do any crawling (most importanly no deleting) when it is 
unable to be authenticated. Together with this a way to remove the session 
cookie when the job is done would also be necessary, so as to avoid the expiry 
of the cookie as a result of manifold using an old cookie.

Side-note; is there any way to make manifold not delete documents when it 
receives a 401?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to