[ https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054343#comment-13054343 ]
Karl Wright commented on CONNECTORS-214: ---------------------------------------- The web connector already filters by mime types, but it filters using the mime types accepted by the output connection. This makes some degree of sense because presumably the output system is the determinant for what kinds of documents are acceptable for indexing. This makes me wonder whether we'd be better off adding BOTH post-fetch indexing URL filtering and mime-type filtering to the Solr output connector. Right now, the Solr output connector tells the world it accepts all mime types, but we can readily put that under user control. The downside of that approach is that some repository connectors don't even know the mime types of the documents they are crawling, and thus this feature would be superfluous and confusing with those connectors. URL filtering, though, would always be appropriate. > Add post-extraction inclusions and exclusions into the web connector > -------------------------------------------------------------------- > > Key: CONNECTORS-214 > URL: https://issues.apache.org/jira/browse/CONNECTORS-214 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector > Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2 > Reporter: Erlend GarĂ¥sen > Assignee: Erlend GarĂ¥sen > Fix For: ManifoldCF next > > > If html files are excluded for a job, links in these files will not be > followed. If we add inclusion and exclusion filters based on post-extraction, > it will be possible to fetch only certain types of documents, such as PDFs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira