[ 
https://issues.apache.org/jira/browse/ANY23-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224203#comment-13224203
 ] 

Lewis John McGibbney commented on ANY23-55:
-------------------------------------------

Hi Szymon. I don't know about the regex pattern being used here, but over in 
Nutch we 
- skip URLs containing certain characters as probable queries, etc. e.g. ?*!@= 
(possibly source of problem)
- skip URLs with slash-delimited segments that repeats 3+ times, to break loops 
e.g. .*(/[^/]+)/[^/]+\1/[^/]+\1/ (doesn't look probable)

Also we default to a zero value for the maximum number of redirects the fetcher 
will follow when trying to fetch a page. If set to negative or 0, fetcher won't 
immediately follow redirected URLs, instead it will record them for later 
fetching.

Can you confirm what kind of implementation the Sindice crawler is using, does 
it utilise the Any23 basic-crawler or does it use some other implementation?
                
> any23 is not following the redirection
> --------------------------------------
>
>                 Key: ANY23-55
>                 URL: https://issues.apache.org/jira/browse/ANY23-55
>             Project: Apache Any23
>          Issue Type: Bug
>         Environment: version 0.6.2-SNAPSHOT deployed currently at any23.org
>            Reporter: Szymon Danielczyk
>
> here is a redirection pattern 
> http://purl.obolibrary.org/obo/IAO_0000030  
> -> 302   Location=http://www.berkeleybop.org/ontologies/IAO_0000030
> http://www.berkeleybop.org/ontologies/IAO_0000030  
> -> 303 Location=http://purl.obolibrary.org/obo/IAO/about/IAO_0000030
> http://purl.obolibrary.org/obo/IAO/about/IAO_0000030 
> -> 302  
> Location=http://www.ontobee.org/browser/rdf.php?o=IAO&iri=http://purl.obolibrary.org/obo/IAO_0000030
> http://www.ontobee.org/browser/rdf.php?o=IAO&iri=http://purl.obolibrary.org/obo/IAO_0000030
>  
> 200 this is the final correct page
> Any23 reports no matching extractor found 
> for http://purl.obolibrary.org/obo/IAO_0000030 
> - probably it can not follow a redirection on some stage 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to