[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Schuch updated CONNECTORS-1571: -------------------------------------- Description: The Web Crawler Connector extracts the MIME type from the request Content-Type header. Then it truncates the possible {{charset=whatever_encoding}} and lets the pipeline check if the resulting MIME type (without the charset) {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. When sending the actual {{RepositoryDocument}} it sets the full MIME type (with the charset) in the document. This is no major bug, but a small inconsistency since the HttpPoster of the Solr Output Connector performs a "hard" check of the MIME type again which can have different outcome than the preceding check activity. I think this was introduced or (better) revealed with CONNECTORS-1482. Example: - In my scenario a crawled webpage has Content-Type {{text/html; charset=utf-8}} - the {{activities.checkMimeTypeIndexable(contentType);}} is called with {{text/html}} - the hard check performed by the Solr Connector is called with {{text/html; charset=utf-8}} was: The Web Crawler Connector extracts the MIME type from the request Content-Type header. Then it truncates the possible {{charset=whatever_encoding}} and lets the pipeline check if the resulting MIME type (without the charset) {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. When sending the actual {{RepositoryDocument}} it sets the full MIME type (with the charset) in the document. This is no major bug, but a small inconsistency since the HttpPoster of the Solr Output Connector performs a "hard" check of the MIME type again which can have different outcome than the preceding check activity. I think this was introduced or (better) revealed with CONNECTORS-1482. > Web Crawler Connector checks different MIME type than it is sending down the > pipeline > ------------------------------------------------------------------------------------- > > Key: CONNECTORS-1571 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1571 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 2.10 > Reporter: Markus Schuch > Priority: Minor > > The Web Crawler Connector extracts the MIME type from the request > Content-Type header. > Then it truncates the possible {{charset=whatever_encoding}} and lets the > pipeline check if the resulting MIME type (without the charset) > {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. > When sending the actual {{RepositoryDocument}} it sets the full MIME type > (with the charset) in the document. This is no major bug, but a small > inconsistency since the HttpPoster of the Solr Output Connector performs a > "hard" check of the MIME type again which can have different outcome than the > preceding check activity. > I think this was introduced or (better) revealed with CONNECTORS-1482. > Example: > - In my scenario a crawled webpage has Content-Type {{text/html; > charset=utf-8}} > - the {{activities.checkMimeTypeIndexable(contentType);}} is called with > {{text/html}} > - the hard check performed by the Solr Connector is called with {{text/html; > charset=utf-8}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)