Re: Repository connector for source with with delta API

Karl Wright Mon, 27 May 2019 13:56:10 -0700

This is very different from the design you originally told me you were
going to do.


Generally, using hopcounts for managing your documents is a bad practice;
this is expensive to do and almost always yields unexpected results.
You could have one job per seed, which means all you have to do to make the
seed go away is delete the job corresponding to it.  If you have way too
many seeds for that, you might want to combine this all into one job, but
then you would need to link your documents somehow to the seed they came
from, so that if the seed was no longer part of the job specification, it
could always be detected as a deletion.  Unfortunately, this is
inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme you'd
need to *detect* the deletion, because you wouldn't be told by the
repository that somebody had changed the configuration.

So two choices: (1) Exactly one seed per job, or (2) don't use
MODEL_ADD_CHANGE_DELETE.

Karl


On Mon, May 27, 2019 at 4:38 PM Raman Gupta <rocketra...@gmail.com> wrote:

> Thanks for your help Karl. So I think I'm converging on a design.
> First of all, per your recommendation, I've switched to scheduled
> crawl and it executes as expected every minute with the "schedule
> window anytime" setting.
>
> My next problem is dealing with seed deletion. My upstream source
> actually has multiple "roots" i.e. each root has its own set of
> documents, and the delta API must be called once for each "root". To
> deal with this, I'm specifying each "root" as  a "seed document", and
> each such root/seed creates "contained_in" documents. It is also
> possible for a "root" to be deleted by a user of the upstream system.
>
> My job is defined with an accurate hopcount as follows:
>
> "job": {
>   ... snip naming, scheduling, output connectors, doc spec....
>   "hopcount_mode" to "accurate"
>   "hopcount" to json {
>     "link_type" to "contained_in"
>     "count" to 1
>   },
>
> For each seed, in processDocuments I am doing:
>
> activities.addDocumentReference("... doc identifier ...",
> seedDocumentIdentifier, "contained_in");
>
> and then this triggers processDocuments for each of those documents,
> as expected.
>
> How do I code the connector such that I can now remove the documents
> that are now unreachable due to the deleted seed? I don't see any
> calls to `processDocuments` via the framework that would allow me to
> do this.
>
> Regards,
> Raman
>
>
> On Fri, May 24, 2019 at 7:29 PM Karl Wright <daddy...@gmail.com> wrote:
> >
> > Hi Raman,
> >
> > (1) Continuous crawl is not a good model for you.  It's meant for
> crawling
> > large web domains, not the kind of task you are doing.
> > (2) Scheduled crawl will work fine for you if you simply tell it "start
> > within schedule window" and make sure your schedule completely covers
> 7x24
> > times.  So you can do this with one record, which triggers on every day
> of
> > the week, that has a schedule window of 24 hours.
> >
> > Karl
> >
> >
> > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <rocketra...@gmail.com>
> wrote:
> >
> > > Yes, we are indeed running it in continuous crawl mode. Scheduled mode
> > > works, but given we have a delta API, we thought this is what makes
> > > sense, as the delta API is efficient and we don't need to wait an
> > > entire day for a scheduled job to run. I see that if I change recrawl
> > > interval and max recrawl interval also to 1 minute, then my documents
> > > do get processed each time. However, now we have the opposite problem:
> > > now the documents are reprocessed every minute, regardless of whether
> > > they were reseeded or not, which makes no sense to me. If I am using
> > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
> > > then why are the same documents being reprocessed over and over? I
> > > have sent the output to the NullOutput using
> > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > same documents are repeatedly sent to processDocuments.
> > >
> > > I just want to process the particular documents I specify on each
> > > iteration every 60 seconds -- no more, no less, and yet I seem unable
> > > to build a connector that does this.
> > >
> > > If I move to a non-contiguous mode, do I really have to create 1440
> > > schedule objects, one for each minute of each day? The way the
> > > schedule seems to be put together, I don't see a way to just schedule
> > > every minute with one schedule. I would have expected schedules to
> > > just use cron expressions.
> > >
> > > If I move to the design #2 in my OP and have one "virtual document" to
> > > just avoid the seeding stage all-together, then is there some place
> > > where I can store the delta token state? Or does my connector have to
> > > create its own db table to store this?
> > >
> > > Regards,
> > > Raman
> > >
> > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com>
> wrote:
> > > >
> > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > >
> > > > You were saying that every minute a addSeedDocuments is being called,
> > > > correct?  It sounds to me like you are running this job in continuous
> > > crawl
> > > > mode.  Can you try running the job in non-continuous mode, and just
> > > > repeating the job run once it completes?
> > > >
> > > > The reason I ask is because continuous crawling has very unique
> kinds of
> > > > ways of dealing with documents it has crawled.  It uses "exponential
> > > > backoff" to schedule the next document crawl and that is probably
> why you
> > > > see the documents in the queue but not being processed; you simply
> > > haven't
> > > > waited long enough.
> > > >
> > > > Karl
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <rocketra...@gmail.com>
> > > wrote:
> > > >
> > > > > Here are my addSeedDocuments and processDocuments methods
> simplifying
> > > > > them down to the minimum necessary to show what is happening:
> > > > >
> > > > > @Override
> > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > Specification
> > > > > spec,
> > > > >                                String lastSeedVersion, long
> seedTime,
> > > > > int jobMode)
> > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > {
> > > > >   // return the same 3 docs every time, simulating an initial
> load, and
> > > > > then
> > > > >   // these 3 docs changing constantly
> > > > >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> > > > >   activities.addSeedDocument("100');
> > > > >   activities.addSeedDocument("110');
> > > > >   activities.addSeedDocument("120');
> > > > >   System.out.println("SEEDING DONE");
> > > > >   return null
> > > > > }
> > > > >
> > > > > @Override
> > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > IExistingVersions statuses, Specification spec,
> > > > >                              IProcessActivity activities, int
> jobMode,
> > > > > boolean usesDefaultAuthority)
> > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > Arrays.deepToString(documentIdentifiers) );
> > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > >   //}
> > > > >
> > > > >   // I've commented out all subsequent logic here, but adding the
> call
> > > to
> > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > version, documentUri, rd);
> > > > >   // does not change anything
> > > > > }
> > > > >
> > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > >
> > > > > -=-= SeedTime=1558733436082
> > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > -=-= SeedTime=1558733549367
> > > > > -=-= SeedTime=1558733609384
> > > > > -=-= SeedTime=1558733436082
> > > > > etc.
> > > > >
> > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> then
> > > > > never again, even though "SEEDING DONE" is printing every minute.
> If
> > > > > and only if I uncomment the for loop which deletes the documents
> does
> > > > > "processDocuments" get called again for those seed document ids.
> > > > >
> > > > > I do note that the queue shows documents 100, 110, and 120 in state
> > > > > "Waiting for processing", and nothing I do seems to affect that.
> The
> > > > > database update in JobQueue.updateExistingRecordInitial is a no-op
> for
> > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY and
> the
> > > > > update does not actually change anything in the db.
> > > > >
> > > > > Regards,
> > > > > Raman
> > > > >
> > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > For any given job run, all documents that are added via
> > > > > addSeedDocuments()
> > > > > > should be processed.  There is no magic in the framework that
> somehow
> > > > > knows
> > > > > > that a document has been created vs. modified vs. deleted until
> > > > > > processDocuments() is called.  If your claim is that this
> contract
> > > is not
> > > > > > being honored, could you try changing your connector model to
> > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to
> > > work
> > > > > > using that model.  If it does *not* then clearly you've got some
> > > kind of
> > > > > > implementation problem at the addSeedDocuments() level because
> most
> > > of
> > > > > the
> > > > > > Manifold connectors use that model.
> > > > > >
> > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is
> to
> > > figure
> > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> rocketra...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> daddy...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments()
> > > basically
> > > > > says
> > > > > > > > that you have to include *at least* the documents that were
> > > changed,
> > > > > > > added,
> > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > provided, it
> > > > > > > should
> > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > >
> > > > > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > > > > documents, and those are exactly the ones that we are
> including.
> > > > > > >
> > > > > > > > If you are, the next thing to look at is the computation of
> the
> > > > > version
> > > > > > > > string.  The version string is what is used to figure out if
> a
> > > change
> > > > > > > took
> > > > > > > > place.  You need this IN ADDITION TO the addSeedDocuments()
> > > doing the
> > > > > > > right
> > > > > > > > thing.  For deleted documents, obviously the
> processDocuments()
> > > > > should
> > > > > > > call
> > > > > > > > the activities.deleteDocument() method.
> > > > > > >
> > > > > > > The version String is calculated by `processDocuments`. Since
> after
> > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > `processDocuments` is never called again for that document,
> even
> > > > > > > though it has been modified to document A version 2.
> Therefore, our
> > > > > > > connector never gets a chance to return the "version 2" string.
> > > > > > >
> > > > > > > > Does this sound like what your code is doing?
> > > > > > >
> > > > > > > Yes, as far as we can go given the fact that
> `processDocuments` is
> > > > > > > only called once for any particular document identifier.
> > > > > > >
> > > > > > > > Karl
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > rocketra...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > My team is creating a new repository connector. The source
> > > system
> > > > > has
> > > > > > > > > a delta API that lets us know of all new, modified, and
> deleted
> > > > > > > > > individual folders and documents since the last call to the
> > > API.
> > > > > Each
> > > > > > > > > call to the delta API provides the changes, as well as a
> token
> > > > > which
> > > > > > > > > can be provided on subsequent calls to get changes since
> that
> > > token
> > > > > > > > > was generated/returned.
> > > > > > > > >
> > > > > > > > > What is the best approach to building a repo connector to a
> > > system
> > > > > > > > > that has this type of delta API?
> > > > > > > > >
> > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > >
> > > > > > > > > * In addSeedDocuments, on the initial call we seed every
> > > document
> > > > > in
> > > > > > > > > the source system. On subsequent calls, we use the delta
> API to
> > > > > seed
> > > > > > > > > every added, modified, or deleted file. We return the
> delta API
> > > > > token
> > > > > > > > > as the version value of addSeedDocuments, so that it an be
> > > used on
> > > > > > > > > subsequent calls.
> > > > > > > > >
> > > > > > > > > * In processDocuments, we do the usual thing for each
> document
> > > > > > > identifier.
> > > > > > > > >
> > > > > > > > > On prototyping, this works for new docs, but
> > > "processDocuments" is
> > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > >
> > > > > > > > > A second design we are considering is to use
> > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments
> > > return
> > > > > only
> > > > > > > > > one "virtual" document, which represents the root of the
> remote
> > > > > repo.
> > > > > > > > >
> > > > > > > > > Then, in "processDocuments" the new "document" is used to
> > > determine
> > > > > > > > > all the child documents of that delta call, which are then
> > > added to
> > > > > > > > > the queue via `activities.addDocumentReference`. To force
> the
> > > > > "virtual
> > > > > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > > > > `addSeedDocuments`, we do
> > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > well.
> > > > > > > > >
> > > > > > > > > With this alternative design, the stage 1 seed effectively
> > > becomes
> > > > > a
> > > > > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Raman Gupta
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Reply via email to