[
https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089128#comment-13089128
]
Karl Wright commented on CONNECTORS-243:
----------------------------------------
Looking at this further, there are a number of headers that would be bad to
include in metadata. For example, you would not want to include anything
authentication related or session related. Any transient information should
also be excluded, since that will cause ManifoldCF to be unable to avoid
refetching the document on each job run. Here's the list of exclusions I've
come up with so far:
Age
WWW-Authenticate
Proxy-Authenticate
Date
Set-cookie
Via
Any I've missed?
> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata
> to output
> --------------------------------------------------------------------------------------
>
> Key: CONNECTORS-243
> URL: https://issues.apache.org/jira/browse/CONNECTORS-243
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Web connector
> Affects Versions: ManifoldCF 0.2
> Reporter: Jan Høydahl
> Assignee: Karl Wright
> Labels: last-modified
>
> Last-Modified is important in web search, at it may be used for (de)boosting
> based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP
> headers from source document and pass it on.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira