Generic SMB errors we can deal with differently, yes.
Not existing/not readable columns in JDBC sounds much more fatal to me.
Indexing errors in Solr because of non-ascii characters sounds like a true
three-alarm fire frankly and we wouldn't want to just ignore those.

Karl


On Thu, Jun 6, 2019 at 5:37 AM Julien Massiera <
[email protected]> wrote:

> Hi Karl,
>
> sure, all errors are not the same and we cannot deal the same way with
> OOM errors than with "file no longuer exists" error for example.
>
> The classes of errors that are triggering frequent job abortions are
> generic errors like:
> - SMBException errors for the win share connector
> - problematic/not existing/not readable columns/blobs for the JDBC
> connector
> - more recently we noticed insertion errors with the Solr output
> connector with documents containing metadata with non ASCII characters
> (errors occured with chinese/japanese chars). The error mentioned a HTTP
> bad request header, so most propably a 4xx/5xx HTTP error.
>
> Do you think we can work out something to postpone/skip these classes of
> errors ? Would be great !
>
> Regards,
> Julien
>
> On 05/06/2019 23:29, Karl Wright wrote:
> > Please let me note that there are *tons* of errors you can get when
> > crawling, from database errors to out-of-memory conditions to the actual
> > ones you care about, namely errors accessing the repository.  It is
> crucial
> > that the connector code separate these errors into those that are fatal,
> > those that can be retried, and those that indicate that the document
> should
> > be skipped.  It is simply not workable to try to insist that all errors
> are
> > the same.
> >
> > The difficulty comes in what the default behavior is for certain classes
> of
> > errors that we've never seen before.  I'm perfectly fine with trying to
> > establish such a policy as you suggest in approach 1 for general classes
> of
> > errors that are seen.  But once again we need to catalog these and
> > enumerate at least what classes these are.  That's necessary on a
> > connector-by-connector basis.
> >
> > The "brute force" approach of simply accepting all errors and continuing
> no
> > matter what will not work, because really it's the same problem and the
> > same bit of information you'd need to properly implement this.  There's
> no
> > shortcut I'm afraid.
> >
> > Please let me know which errors you are seeing and for which connector
> and
> > let's work out how we handle them (or similar ones).
> >
> > Karl
> >
> >
> > On Wed, Jun 5, 2019 at 10:41 AM Julien Massiera <
> > [email protected]> wrote:
> >
> >> Hi Karl,
> >>
> >> I don't know for other MCF users, but we have many use cases where we
> >> need to crawl several millions of documents from different kinds of
> >> repositories. With those, we sometime have difficulties to manage issues
> >> when crawl jobs suddenly stop because of problematic files that can only
> >> be filtered to avoid the job to abort.
> >>
> >>   From past discussions in the mailing list, I think that from your
> point
> >> of view, it is preferable to stop a job when it encounters (or after
> >> several failing retries) an unknown and/or unexpected issue in order to
> >> be aware of this issue and fix it.
> >>
> >> Although I can understand your point of view, I do not think it
> >> represents the exhaustivity of expected MCF behaviors in production. As
> >> a matter of fact, we have encountered several times scenarios where
> >> customers would prefer an approach where the crawl tries moving on,
> >> while still giving us the possibility to investigate any file that may
> >> have been skipped (One of the argument is that sometimes, jobs are
> >> started on Friday evenings, and if it aborts during the weekend, we lost
> >> at worse 60h of crawling before the admin can check the status of the
> job).
> >>
> >> Yet as of now, this is not feasible, as jobs end up aborting when
> >> encountering non-clearly identified problematic files.
> >>
> >> We have brainstormed internally, and we have a proposal which we think
> >> can satisfy both your view and ours, which we hope you consider as
> >> satisfying :
> >>
> >> Whenever a job encounters an error that is not clearly identified :
> >> 1. It immediately retries one time;
> >> 2. If it succeeds, the crawl moves on as usual;
> >> 3. If it fails, the job moves this document to the current end of the
> >> processing pipeline, and crawls the remaining documents. It increments
> >> the counter of tentative for this document to 2.
> >> 4. When encountering this document again, the job tries again. If it
> >> succeeds, the crawl moves on as usual. If it fails, it moves this
> >> document to the current end of the processing pipeline, increment the
> >> counter of 1, and doubles the delay between two tentatives.
> >> 5. We iterate until the maximum number of tentatives of the crawl for
> >> the problematic document has been reached. If it fails, abort the crawl.
> >> With this behavior, a job is finally aborted on critical errors but at
> >> least we will be able to crawl a maximum number of non problematic
> >> documents till the failure.
> >>
> >> Another more "direct" approach, could be to simply have an optional
> >> parameter for a job: a "skip errors" checkbox. This parameter would tell
> >> a job to skip any encountered error. This is assuming we properly log
> >> the errors in the log files and/or in the simple history, thus allowing
> >> us to debug later on.
> >>
> >> We would gladly welcome your thoughts on these 2 approaches.
> >>
> >> Regards,
> >> Julien
> >>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers
> Summit
> www.francelabs.com
>
>

Reply via email to