[
https://issues.apache.org/jira/browse/CONNECTORS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318515#comment-16318515
]
Karl Wright commented on CONNECTORS-1482:
-----------------------------------------
The mime type exclusion is done as follows:
{code}
/** Detect if a mime type is indexable or not. This method is used by
participating repository connectors to pre-filter the number of
* unusable documents that will be passed to this output connector.
*@param outputDescription is the document's output version.
*@param mimeType is the mime type of the document.
*@return true if the mime type is indexable by this connector.
*/
@Override
public boolean checkMimeTypeIndexable(VersionContext outputDescription,
String mimeType, IOutputCheckActivity activities)
throws ManifoldCFException, ServiceInterruption
{
getSession();
if (useExtractUpdateHandler)
{
if (includedMimeTypes != null && includedMimeTypes.get(mimeType) == null)
return false;
if (excludedMimeTypes != null && excludedMimeTypes.get(mimeType) != null)
return false;
return true;
}
return acceptableMimeTypes.contains(mimeType.toLowerCase(Locale.ROOT));
}
{code}
Some things to note about this. First, you can only exclude mime types if you
are using the extracting update handler. This explains why the standard
handler doesn't do it. Second, the check is case sensitive, which is a problem
in my opinion. That's easily fixed though. Third, this is used ONLY to tell
the upstream connector not to send the document, so it can potentially be
ignored if the upstream connector doesn't play along. A hard check really
ought to be added in HttpPoster.
> Mime type exclusion and document length exclusion in Solr output connector
> don't apparently work
> ------------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1482
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1482
> Project: ManifoldCF
> Issue Type: Bug
> Components: Lucene/SOLR connector
> Affects Versions: ManifoldCF 2.9
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 2.10
>
> Attachments: problem_documents_connector.png,
> problem_documents_connector_solr.png,
> problem_documents_connector_solr_stream_size.png
>
>
> See attached images. Setting exclusions apparently does not prevent
> documents with that mime type from being included. This may be because of
> regexp characters etc but it needs to be researched and documented at least.
> Also, the length limitation doesn't seem to be working either.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)