Generic SMB errors we can deal with differently, yes. Not existing/not readable columns in JDBC sounds much more fatal to me. Indexing errors in Solr because of non-ascii characters sounds like a true three-alarm fire frankly and we wouldn't want to just ignore those.
Karl On Thu, Jun 6, 2019 at 5:37 AM Julien Massiera < [email protected]> wrote: > Hi Karl, > > sure, all errors are not the same and we cannot deal the same way with > OOM errors than with "file no longuer exists" error for example. > > The classes of errors that are triggering frequent job abortions are > generic errors like: > - SMBException errors for the win share connector > - problematic/not existing/not readable columns/blobs for the JDBC > connector > - more recently we noticed insertion errors with the Solr output > connector with documents containing metadata with non ASCII characters > (errors occured with chinese/japanese chars). The error mentioned a HTTP > bad request header, so most propably a 4xx/5xx HTTP error. > > Do you think we can work out something to postpone/skip these classes of > errors ? Would be great ! > > Regards, > Julien > > On 05/06/2019 23:29, Karl Wright wrote: > > Please let me note that there are *tons* of errors you can get when > > crawling, from database errors to out-of-memory conditions to the actual > > ones you care about, namely errors accessing the repository. It is > crucial > > that the connector code separate these errors into those that are fatal, > > those that can be retried, and those that indicate that the document > should > > be skipped. It is simply not workable to try to insist that all errors > are > > the same. > > > > The difficulty comes in what the default behavior is for certain classes > of > > errors that we've never seen before. I'm perfectly fine with trying to > > establish such a policy as you suggest in approach 1 for general classes > of > > errors that are seen. But once again we need to catalog these and > > enumerate at least what classes these are. That's necessary on a > > connector-by-connector basis. > > > > The "brute force" approach of simply accepting all errors and continuing > no > > matter what will not work, because really it's the same problem and the > > same bit of information you'd need to properly implement this. There's > no > > shortcut I'm afraid. > > > > Please let me know which errors you are seeing and for which connector > and > > let's work out how we handle them (or similar ones). > > > > Karl > > > > > > On Wed, Jun 5, 2019 at 10:41 AM Julien Massiera < > > [email protected]> wrote: > > > >> Hi Karl, > >> > >> I don't know for other MCF users, but we have many use cases where we > >> need to crawl several millions of documents from different kinds of > >> repositories. With those, we sometime have difficulties to manage issues > >> when crawl jobs suddenly stop because of problematic files that can only > >> be filtered to avoid the job to abort. > >> > >> From past discussions in the mailing list, I think that from your > point > >> of view, it is preferable to stop a job when it encounters (or after > >> several failing retries) an unknown and/or unexpected issue in order to > >> be aware of this issue and fix it. > >> > >> Although I can understand your point of view, I do not think it > >> represents the exhaustivity of expected MCF behaviors in production. As > >> a matter of fact, we have encountered several times scenarios where > >> customers would prefer an approach where the crawl tries moving on, > >> while still giving us the possibility to investigate any file that may > >> have been skipped (One of the argument is that sometimes, jobs are > >> started on Friday evenings, and if it aborts during the weekend, we lost > >> at worse 60h of crawling before the admin can check the status of the > job). > >> > >> Yet as of now, this is not feasible, as jobs end up aborting when > >> encountering non-clearly identified problematic files. > >> > >> We have brainstormed internally, and we have a proposal which we think > >> can satisfy both your view and ours, which we hope you consider as > >> satisfying : > >> > >> Whenever a job encounters an error that is not clearly identified : > >> 1. It immediately retries one time; > >> 2. If it succeeds, the crawl moves on as usual; > >> 3. If it fails, the job moves this document to the current end of the > >> processing pipeline, and crawls the remaining documents. It increments > >> the counter of tentative for this document to 2. > >> 4. When encountering this document again, the job tries again. If it > >> succeeds, the crawl moves on as usual. If it fails, it moves this > >> document to the current end of the processing pipeline, increment the > >> counter of 1, and doubles the delay between two tentatives. > >> 5. We iterate until the maximum number of tentatives of the crawl for > >> the problematic document has been reached. If it fails, abort the crawl. > >> With this behavior, a job is finally aborted on critical errors but at > >> least we will be able to crawl a maximum number of non problematic > >> documents till the failure. > >> > >> Another more "direct" approach, could be to simply have an optional > >> parameter for a job: a "skip errors" checkbox. This parameter would tell > >> a job to skip any encountered error. This is assuming we properly log > >> the errors in the log files and/or in the simple history, thus allowing > >> us to debug later on. > >> > >> We would gladly welcome your thoughts on these 2 approaches. > >> > >> Regards, > >> Julien > >> > -- > Julien MASSIERA > Directeur développement produit > France Labs – Les experts du Search > Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers > Summit > www.francelabs.com > >
