[
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wright updated CONNECTORS-1547:
------------------------------------
Fix Version/s: ManifoldCF 2.12
> No activity record for for excluded documents in WebCrawlerConnector
> --------------------------------------------------------------------
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Reporter: Olivier Tavard
> Assignee: Karl Wright
> Priority: Minor
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log,
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by
> the Document Filter transformation connector in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector
> Web repository connector
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx)
> documents
> The simple history does not mention the documents excluded (excepted for html
> documents). They have fetch activity and that's all (see
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity
> activity on connectors) :
> {code:java}
> Removing url
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
> because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type
> ('"+contentType+"')";
> fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
> activityResultCode = null;{code}
> The activityResultCode is null.
>
>
> If we configure the same job but for a Local File system connector with the
> same Document Filter transformation connector, the simple history mentions
> all the documents excluded in the simple history (see
> simple_history_files.jpeg) and the code mentions a specific error code with
> an activity record logged (class FileConnector l. 415) :
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
> {
> errorCode = activities.EXCLUDED_MIMETYPE;
> errorDesc = "Excluded because mime type ('"+mimeType+"')";
> Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because
> mime type ('"+mimeType+"') was excluded by output connector.");
> activities.noDocument(documentIdentifier,versionString);
> continue;
> }{code}
>
> So the Web Crawler connector should have the same behaviour than for
> FileConnector and explicitly mention all the documents excluded by the user I
> think.
>
> Best regards,
> Olivier
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)