[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054343#comment-13054343
 ] 

Karl Wright commented on CONNECTORS-214:
----------------------------------------

The web connector already filters by mime types, but it filters using the mime 
types accepted by the output connection.  This makes some degree of sense 
because presumably the output system is the determinant for what kinds of 
documents are acceptable for indexing.

This makes me wonder whether we'd be better off adding BOTH post-fetch indexing 
URL filtering and mime-type filtering to the Solr output connector.  Right now, 
the Solr output connector tells the world it accepts all mime types, but we can 
readily put that under user control.  The downside of that approach is that 
some repository connectors don't even know the mime types of the documents they 
are crawling, and thus this feature would be superfluous and confusing with 
those connectors.  URL filtering, though, would always be appropriate.


> Add post-extraction inclusions and exclusions into the web connector
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-214
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
>            Reporter: Erlend GarĂ¥sen
>            Assignee: Erlend GarĂ¥sen
>             Fix For: ManifoldCF next
>
>
> If html files are excluded for a job, links in these files will not be 
> followed. If we add inclusion and exclusion filters based on post-extraction, 
> it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to