[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

Karl Wright (JIRA) Thu, 13 Feb 2014 05:28:44 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900316#comment-13900316
 ]


Karl Wright commented on CONNECTORS-850:
----------------------------------------

So, just to be clear, the Web Connector tries to ignore headers that would 
change on every fetch.  Not ignoring these would make incremental crawling 
essentially meaningless.  Here's the set of headers that are ignored right now:

{code}
  // Reserved headers
  protected static Map<String,String> reservedHeaders;
  static
  {
    reservedHeaders = new HashMap<String,String>();
    reservedHeaders.put("age","age");
    reservedHeaders.put("www-authenticate","www-authenticate");
    reservedHeaders.put("proxy-authenticate","proxy-authenticate");
    reservedHeaders.put("date","date");
    reservedHeaders.put("set-cookie","set-cookie");
    reservedHeaders.put("via","via");
  }
{code}

It may be that other headers need to be added to this list.  If that is what 
your determination is, please recommend additions and we'll add them.


> Maximum interval in dynamic crawling
> ------------------------------------
>
>                 Key: CONNECTORS-850
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.4.1
>            Reporter: Florian Schmedding
>            Assignee: Karl Wright
>            Priority: Minor
>              Labels: features
>             Fix For: ManifoldCF 1.5
>
>
> Currently, the dynamic crawling method used for a continuous job extends the 
> reseed and recrawl intervals when no changes are found in a checked document. 
> However, it should be possible to restrict this extension to a maximum value 
> in order to make sure that new documents are discovered within a certain 
> interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

Reply via email to