[ 
https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089128#comment-13089128
 ] 

Karl Wright commented on CONNECTORS-243:
----------------------------------------

Looking at this further, there are a number of headers that would be bad to 
include in metadata.  For example, you would not want to include anything 
authentication related or session related.  Any transient information should 
also be excluded, since that will cause ManifoldCF to be unable to avoid 
refetching the document on each job run.  Here's the list of exclusions I've 
come up with so far:

Age
WWW-Authenticate
Proxy-Authenticate
Date
Set-cookie
Via

Any I've missed?



> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata 
> to output
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-243
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Jan Høydahl
>            Assignee: Karl Wright
>              Labels: last-modified
>
> Last-Modified is important in web search, at it may be used for (de)boosting 
> based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP 
> headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to