Olivier Tavard created CONNECTORS-1547:
------------------------------------------
Summary: No activity record for for excluded documents in
WebCrawlerConnector
Key: CONNECTORS-1547
URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
Project: ManifoldCF
Issue Type: Bug
Components: Web connector
Reporter: Olivier Tavard
Attachments: manifoldcf_local_files.log, manifoldcf_web.log,
simple_history_files.jpg, simple_history_web.jpg
Hi,
I noticed that there is no activity record logged for documents excluded by the
Document Filter transformation connector in the WebCrawler connector.
To reproduce the issue on MCF out of the box :
Null output connector
Web repository connector
Job :
- DocumentFilter added which only accepts application/msword (doc/docx)
documents
The simple history does not mention the documents excluded (excepted for html
documents). They have fetch activity and that's all (see
simple_history_web.jpeg).
We can only see the documents excluded by the MCF log (with DEBUG verbosity
activity on connectors) :
{code:java}
Removing url
'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
because it had the wrong content type ('image/png'){code}
(see manifoldcf_local_files.log)
The related code is in WebcrawlerConnector.java l.904 :
{code:java}
fetchStatus.contextMessage = "it had the wrong content type
('"+contentType+"')";
fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
activityResultCode = null;{code}
The activityResultCode is null.
If we configure the same job but for a Local File system connector with the
same Document Filter transformation connector, the simple history mentions all
the documents excluded in the simple history (see simple_history_files.jpeg)
and the code mentions a specific error code with an activity record logged
(class FileConnector l. 415) :
{code:java}
if (!activities.checkMimeTypeIndexable(mimeType))
{
errorCode = activities.EXCLUDED_MIMETYPE;
errorDesc = "Excluded because mime type ('"+mimeType+"')";
Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because mime
type ('"+mimeType+"') was excluded by output connector.");
activities.noDocument(documentIdentifier,versionString);
continue;
}{code}
So the Web Crawler connector should have the same behaviour than for
FileConnector and explicitly mention all the documents excluded by the user I
think.
Best regards,
Olivier
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)