Re: Repository connector for source with with delta API

Raman Gupta Mon, 27 May 2019 14:17:22 -0700

One seed per job is an interesting approach but in the interests of
fully understanding the alternatives let me consider choice #2.


>  you might want to combine this all into one job, but then you would need to 
> link your documents somehow to the seed they came from, so that if the seed 
> was no longer part of the job specification, it could always be detected as a 
> deletion.

There are good reasons for me to prefer a single job, so how would I
accomplish this? Should my connector create its own tables and manage
this state there? Or is there another more light-weight approach?

> Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE, because 
> under that scheme you'd need to *detect* the deletion, because you wouldn't 
> be told by the repository that somebody had changed the configuration.

That is fine and I understand completely -- I forgot to mention in my
previous message that I've already switched to MODEL_ALL, and am
detecting and providing the list of currently active seeds on every
call to addSeedDocuments.

Regards,
Raman

On Mon, May 27, 2019 at 4:55 PM Karl Wright <daddy...@gmail.com> wrote:
>
> This is very different from the design you originally told me you were
> going to do.
>
> Generally, using hopcounts for managing your documents is a bad practice;
> this is expensive to do and almost always yields unexpected results.
> You could have one job per seed, which means all you have to do to make the
> seed go away is delete the job corresponding to it.  If you have way too
> many seeds for that, you might want to combine this all into one job, but
> then you would need to link your documents somehow to the seed they came
> from, so that if the seed was no longer part of the job specification, it
> could always be detected as a deletion.  Unfortunately, this is
> inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme you'd
> need to *detect* the deletion, because you wouldn't be told by the
> repository that somebody had changed the configuration.
>
> So two choices: (1) Exactly one seed per job, or (2) don't use
> MODEL_ADD_CHANGE_DELETE.
>
> Karl
>
>
> On Mon, May 27, 2019 at 4:38 PM Raman Gupta <rocketra...@gmail.com> wrote:
>
> > Thanks for your help Karl. So I think I'm converging on a design.
> > First of all, per your recommendation, I've switched to scheduled
> > crawl and it executes as expected every minute with the "schedule
> > window anytime" setting.
> >
> > My next problem is dealing with seed deletion. My upstream source
> > actually has multiple "roots" i.e. each root has its own set of
> > documents, and the delta API must be called once for each "root". To
> > deal with this, I'm specifying each "root" as  a "seed document", and
> > each such root/seed creates "contained_in" documents. It is also
> > possible for a "root" to be deleted by a user of the upstream system.
> >
> > My job is defined with an accurate hopcount as follows:
> >
> > "job": {
> >   ... snip naming, scheduling, output connectors, doc spec....
> >   "hopcount_mode" to "accurate"
> >   "hopcount" to json {
> >     "link_type" to "contained_in"
> >     "count" to 1
> >   },
> >
> > For each seed, in processDocuments I am doing:
> >
> > activities.addDocumentReference("... doc identifier ...",
> > seedDocumentIdentifier, "contained_in");
> >
> > and then this triggers processDocuments for each of those documents,
> > as expected.
> >
> > How do I code the connector such that I can now remove the documents
> > that are now unreachable due to the deleted seed? I don't see any
> > calls to `processDocuments` via the framework that would allow me to
> > do this.
> >
> > Regards,
> > Raman
> >
> >
> > On Fri, May 24, 2019 at 7:29 PM Karl Wright <daddy...@gmail.com> wrote:
> > >
> > > Hi Raman,
> > >
> > > (1) Continuous crawl is not a good model for you.  It's meant for
> > crawling
> > > large web domains, not the kind of task you are doing.
> > > (2) Scheduled crawl will work fine for you if you simply tell it "start
> > > within schedule window" and make sure your schedule completely covers
> > 7x24
> > > times.  So you can do this with one record, which triggers on every day
> > of
> > > the week, that has a schedule window of 24 hours.
> > >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <rocketra...@gmail.com>
> > wrote:
> > >
> > > > Yes, we are indeed running it in continuous crawl mode. Scheduled mode
> > > > works, but given we have a delta API, we thought this is what makes
> > > > sense, as the delta API is efficient and we don't need to wait an
> > > > entire day for a scheduled job to run. I see that if I change recrawl
> > > > interval and max recrawl interval also to 1 minute, then my documents
> > > > do get processed each time. However, now we have the opposite problem:
> > > > now the documents are reprocessed every minute, regardless of whether
> > > > they were reseeded or not, which makes no sense to me. If I am using
> > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
> > > > then why are the same documents being reprocessed over and over? I
> > > > have sent the output to the NullOutput using
> > > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > > same documents are repeatedly sent to processDocuments.
> > > >
> > > > I just want to process the particular documents I specify on each
> > > > iteration every 60 seconds -- no more, no less, and yet I seem unable
> > > > to build a connector that does this.
> > > >
> > > > If I move to a non-contiguous mode, do I really have to create 1440
> > > > schedule objects, one for each minute of each day? The way the
> > > > schedule seems to be put together, I don't see a way to just schedule
> > > > every minute with one schedule. I would have expected schedules to
> > > > just use cron expressions.
> > > >
> > > > If I move to the design #2 in my OP and have one "virtual document" to
> > > > just avoid the seeding stage all-together, then is there some place
> > > > where I can store the delta token state? Or does my connector have to
> > > > create its own db table to store this?
> > > >
> > > > Regards,
> > > > Raman
> > > >
> > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com>
> > wrote:
> > > > >
> > > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > > >
> > > > > You were saying that every minute a addSeedDocuments is being called,
> > > > > correct?  It sounds to me like you are running this job in continuous
> > > > crawl
> > > > > mode.  Can you try running the job in non-continuous mode, and just
> > > > > repeating the job run once it completes?
> > > > >
> > > > > The reason I ask is because continuous crawling has very unique
> > kinds of
> > > > > ways of dealing with documents it has crawled.  It uses "exponential
> > > > > backoff" to schedule the next document crawl and that is probably
> > why you
> > > > > see the documents in the queue but not being processed; you simply
> > > > haven't
> > > > > waited long enough.
> > > > >
> > > > > Karl
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <rocketra...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Here are my addSeedDocuments and processDocuments methods
> > simplifying
> > > > > > them down to the minimum necessary to show what is happening:
> > > > > >
> > > > > > @Override
> > > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > > Specification
> > > > > > spec,
> > > > > >                                String lastSeedVersion, long
> > seedTime,
> > > > > > int jobMode)
> > > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > > {
> > > > > >   // return the same 3 docs every time, simulating an initial
> > load, and
> > > > > > then
> > > > > >   // these 3 docs changing constantly
> > > > > >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> > > > > >   activities.addSeedDocument("100');
> > > > > >   activities.addSeedDocument("110');
> > > > > >   activities.addSeedDocument("120');
> > > > > >   System.out.println("SEEDING DONE");
> > > > > >   return null
> > > > > > }
> > > > > >
> > > > > > @Override
> > > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > > IExistingVersions statuses, Specification spec,
> > > > > >                              IProcessActivity activities, int
> > jobMode,
> > > > > > boolean usesDefaultAuthority)
> > > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > > Arrays.deepToString(documentIdentifiers) );
> > > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > > >   //}
> > > > > >
> > > > > >   // I've commented out all subsequent logic here, but adding the
> > call
> > > > to
> > > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > > version, documentUri, rd);
> > > > > >   // does not change anything
> > > > > > }
> > > > > >
> > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > > >
> > > > > > -=-= SeedTime=1558733436082
> > > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > > -=-= SeedTime=1558733549367
> > > > > > -=-= SeedTime=1558733609384
> > > > > > -=-= SeedTime=1558733436082
> > > > > > etc.
> > > > > >
> > > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> > then
> > > > > > never again, even though "SEEDING DONE" is printing every minute.
> > If
> > > > > > and only if I uncomment the for loop which deletes the documents
> > does
> > > > > > "processDocuments" get called again for those seed document ids.
> > > > > >
> > > > > > I do note that the queue shows documents 100, 110, and 120 in state
> > > > > > "Waiting for processing", and nothing I do seems to affect that.
> > The
> > > > > > database update in JobQueue.updateExistingRecordInitial is a no-op
> > for
> > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY and
> > the
> > > > > > update does not actually change anything in the db.
> > > > > >
> > > > > > Regards,
> > > > > > Raman
> > > > > >
> > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > For any given job run, all documents that are added via
> > > > > > addSeedDocuments()
> > > > > > > should be processed.  There is no magic in the framework that
> > somehow
> > > > > > knows
> > > > > > > that a document has been created vs. modified vs. deleted until
> > > > > > > processDocuments() is called.  If your claim is that this
> > contract
> > > > is not
> > > > > > > being honored, could you try changing your connector model to
> > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to
> > > > work
> > > > > > > using that model.  If it does *not* then clearly you've got some
> > > > kind of
> > > > > > > implementation problem at the addSeedDocuments() level because
> > most
> > > > of
> > > > > > the
> > > > > > > Manifold connectors use that model.
> > > > > > >
> > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is
> > to
> > > > figure
> > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> > rocketra...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> > daddy...@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments()
> > > > basically
> > > > > > says
> > > > > > > > > that you have to include *at least* the documents that were
> > > > changed,
> > > > > > > > added,
> > > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > > provided, it
> > > > > > > > should
> > > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > > >
> > > > > > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > > > > > documents, and those are exactly the ones that we are
> > including.
> > > > > > > >
> > > > > > > > > If you are, the next thing to look at is the computation of
> > the
> > > > > > version
> > > > > > > > > string.  The version string is what is used to figure out if
> > a
> > > > change
> > > > > > > > took
> > > > > > > > > place.  You need this IN ADDITION TO the addSeedDocuments()
> > > > doing the
> > > > > > > > right
> > > > > > > > > thing.  For deleted documents, obviously the
> > processDocuments()
> > > > > > should
> > > > > > > > call
> > > > > > > > > the activities.deleteDocument() method.
> > > > > > > >
> > > > > > > > The version String is calculated by `processDocuments`. Since
> > after
> > > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > > `processDocuments` is never called again for that document,
> > even
> > > > > > > > though it has been modified to document A version 2.
> > Therefore, our
> > > > > > > > connector never gets a chance to return the "version 2" string.
> > > > > > > >
> > > > > > > > > Does this sound like what your code is doing?
> > > > > > > >
> > > > > > > > Yes, as far as we can go given the fact that
> > `processDocuments` is
> > > > > > > > only called once for any particular document identifier.
> > > > > > > >
> > > > > > > > > Karl
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > > rocketra...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > My team is creating a new repository connector. The source
> > > > system
> > > > > > has
> > > > > > > > > > a delta API that lets us know of all new, modified, and
> > deleted
> > > > > > > > > > individual folders and documents since the last call to the
> > > > API.
> > > > > > Each
> > > > > > > > > > call to the delta API provides the changes, as well as a
> > token
> > > > > > which
> > > > > > > > > > can be provided on subsequent calls to get changes since
> > that
> > > > token
> > > > > > > > > > was generated/returned.
> > > > > > > > > >
> > > > > > > > > > What is the best approach to building a repo connector to a
> > > > system
> > > > > > > > > > that has this type of delta API?
> > > > > > > > > >
> > > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > > >
> > > > > > > > > > * In addSeedDocuments, on the initial call we seed every
> > > > document
> > > > > > in
> > > > > > > > > > the source system. On subsequent calls, we use the delta
> > API to
> > > > > > seed
> > > > > > > > > > every added, modified, or deleted file. We return the
> > delta API
> > > > > > token
> > > > > > > > > > as the version value of addSeedDocuments, so that it an be
> > > > used on
> > > > > > > > > > subsequent calls.
> > > > > > > > > >
> > > > > > > > > > * In processDocuments, we do the usual thing for each
> > document
> > > > > > > > identifier.
> > > > > > > > > >
> > > > > > > > > > On prototyping, this works for new docs, but
> > > > "processDocuments" is
> > > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > > >
> > > > > > > > > > A second design we are considering is to use
> > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments
> > > > return
> > > > > > only
> > > > > > > > > > one "virtual" document, which represents the root of the
> > remote
> > > > > > repo.
> > > > > > > > > >
> > > > > > > > > > Then, in "processDocuments" the new "document" is used to
> > > > determine
> > > > > > > > > > all the child documents of that delta call, which are then
> > > > added to
> > > > > > > > > > the queue via `activities.addDocumentReference`. To force
> > the
> > > > > > "virtual
> > > > > > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > > > > > `addSeedDocuments`, we do
> > > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > > well.
> > > > > > > > > >
> > > > > > > > > > With this alternative design, the stage 1 seed effectively
> > > > becomes
> > > > > > a
> > > > > > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Raman Gupta
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Reply via email to