Alternative approaches for jobs aborting on problematic docs

Julien Massiera Wed, 05 Jun 2019 07:41:49 -0700

Hi Karl,

I don't know for other MCF users, but we have many use cases where weneed to crawl several millions of documents from different kinds ofrepositories. With those, we sometime have difficulties to manage issueswhen crawl jobs suddenly stop because of problematic files that can onlybe filtered to avoid the job to abort.

From past discussions in the mailing list, I think that from your pointof view, it is preferable to stop a job when it encounters (or afterseveral failing retries) an unknown and/or unexpected issue in order tobe aware of this issue and fix it.

Although I can understand your point of view, I do not think itrepresents the exhaustivity of expected MCF behaviors in production. Asa matter of fact, we have encountered several times scenarios wherecustomers would prefer an approach where the crawl tries moving on,while still giving us the possibility to investigate any file that mayhave been skipped (One of the argument is that sometimes, jobs arestarted on Friday evenings, and if it aborts during the weekend, we lostat worse 60h of crawling before the admin can check the status of the job).

Yet as of now, this is not feasible, as jobs end up aborting whenencountering non-clearly identified problematic files.

We have brainstormed internally, and we have a proposal which we thinkcan satisfy both your view and ours, which we hope you consider assatisfying :


Whenever a job encounters an error that is not clearly identified :
1. It immediately retries one time;
2. If it succeeds, the crawl moves on as usual;

3. If it fails, the job moves this document to the current end of theprocessing pipeline, and crawls the remaining documents. It incrementsthe counter of tentative for this document to 2.4. When encountering this document again, the job tries again. If itsucceeds, the crawl moves on as usual. If it fails, it moves thisdocument to the current end of the processing pipeline, increment thecounter of 1, and doubles the delay between two tentatives.5. We iterate until the maximum number of tentatives of the crawl forthe problematic document has been reached. If it fails, abort the crawl.With this behavior, a job is finally aborted on critical errors but atleast we will be able to crawl a maximum number of non problematicdocuments till the failure.

Another more "direct" approach, could be to simply have an optionalparameter for a job: a "skip errors" checkbox. This parameter would tella job to skip any encountered error. This is assuming we properly logthe errors in the log files and/or in the simple history, thus allowingus to debug later on.


We would gladly welcome your thoughts on these 2 approaches.

Regards,
Julien

Alternative approaches for jobs aborting on problematic docs

Reply via email to