Re: Repository connector for source with with delta API

Raman Gupta Fri, 24 May 2019 16:13:10 -0700

Yes, we are indeed running it in continuous crawl mode. Scheduled mode
works, but given we have a delta API, we thought this is what makes
sense, as the delta API is efficient and we don't need to wait an
entire day for a scheduled job to run. I see that if I change recrawl
interval and max recrawl interval also to 1 minute, then my documents
do get processed each time. However, now we have the opposite problem:
now the documents are reprocessed every minute, regardless of whether
they were reseeded or not, which makes no sense to me. If I am using
MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
then why are the same documents being reprocessed over and over? I
have sent the output to the NullOutput using
`ingestDocumentWithException` and the status shows OK, and yet the
same documents are repeatedly sent to processDocuments.


I just want to process the particular documents I specify on each
iteration every 60 seconds -- no more, no less, and yet I seem unable
to build a connector that does this.

If I move to a non-contiguous mode, do I really have to create 1440
schedule objects, one for each minute of each day? The way the
schedule seems to be put together, I don't see a way to just schedule
every minute with one schedule. I would have expected schedules to
just use cron expressions.

If I move to the design #2 in my OP and have one "virtual document" to
just avoid the seeding stage all-together, then is there some place
where I can store the delta token state? Or does my connector have to
create its own db table to store this?

Regards,
Raman

On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com> wrote:
>
> So MODEL_ADD_CHANGE does not work for you, eh?
>
> You were saying that every minute a addSeedDocuments is being called,
> correct?  It sounds to me like you are running this job in continuous crawl
> mode.  Can you try running the job in non-continuous mode, and just
> repeating the job run once it completes?
>
> The reason I ask is because continuous crawling has very unique kinds of
> ways of dealing with documents it has crawled.  It uses "exponential
> backoff" to schedule the next document crawl and that is probably why you
> see the documents in the queue but not being processed; you simply haven't
> waited long enough.
>
> Karl
>
> Karl
>
>
> On Fri, May 24, 2019 at 5:36 PM Raman Gupta <rocketra...@gmail.com> wrote:
>
> > Here are my addSeedDocuments and processDocuments methods simplifying
> > them down to the minimum necessary to show what is happening:
> >
> > @Override
> > public String addSeedDocuments(ISeedingActivity activities, Specification
> > spec,
> >                                String lastSeedVersion, long seedTime,
> > int jobMode)
> >   throws ManifoldCFException, ServiceInterruption
> > {
> >   // return the same 3 docs every time, simulating an initial load, and
> > then
> >   // these 3 docs changing constantly
> >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> >   activities.addSeedDocument("100');
> >   activities.addSeedDocument("110');
> >   activities.addSeedDocument("120');
> >   System.out.println("SEEDING DONE");
> >   return null
> > }
> >
> > @Override
> > public void processDocuments(String[] documentIdentifiers,
> > IExistingVersions statuses, Specification spec,
> >                              IProcessActivity activities, int jobMode,
> > boolean usesDefaultAuthority)
> >   throws ManifoldCFException, ServiceInterruption {
> >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > Arrays.deepToString(documentIdentifiers) );
> >   // for (String documentIdentifier : documentIdentifiers) {
> >   //  activities.deleteDocument(documentIdentifier);
> >   //}
> >
> >   // I've commented out all subsequent logic here, but adding the call to
> >   // activities.ingestDocumentWithException(documentIdentifier,
> > version, documentUri, rd);
> >   // does not change anything
> > }
> >
> > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > MODEL_ADD_CHANGE, the output of this is:
> >
> > -=-= SeedTime=1558733436082
> > -=--=-= PROCESS DOCUMENTS: [200]
> > -=--=-= PROCESS DOCUMENTS: [220]
> > -=--=-= PROCESS DOCUMENTS: [210]
> > -=-= SeedTime=1558733549367
> > -=-= SeedTime=1558733609384
> > -=-= SeedTime=1558733436082
> > etc.
> >
> >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
> > never again, even though "SEEDING DONE" is printing every minute. If
> > and only if I uncomment the for loop which deletes the documents does
> > "processDocuments" get called again for those seed document ids.
> >
> > I do note that the queue shows documents 100, 110, and 120 in state
> > "Waiting for processing", and nothing I do seems to affect that. The
> > database update in JobQueue.updateExistingRecordInitial is a no-op for
> > these docs, as the status of them is STATUS_PENDINGPURGATORY and the
> > update does not actually change anything in the db.
> >
> > Regards,
> > Raman
> >
> > On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com> wrote:
> > >
> > > For any given job run, all documents that are added via
> > addSeedDocuments()
> > > should be processed.  There is no magic in the framework that somehow
> > knows
> > > that a document has been created vs. modified vs. deleted until
> > > processDocuments() is called.  If your claim is that this contract is not
> > > being honored, could you try changing your connector model to
> > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
> > > using that model.  If it does *not* then clearly you've got some kind of
> > > implementation problem at the addSeedDocuments() level because most of
> > the
> > > Manifold connectors use that model.
> > >
> > > If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
> > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <rocketra...@gmail.com>
> > wrote:
> > >
> > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <daddy...@gmail.com>
> > wrote:
> > > > >
> > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically
> > says
> > > > > that you have to include *at least* the documents that were changed,
> > > > added,
> > > > > or deleted since the previous stamp, and if no stamp is provided, it
> > > > should
> > > > > return ALL specified documents.  Are you doing that?
> > > >
> > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > documents, and those are exactly the ones that we are including.
> > > >
> > > > > If you are, the next thing to look at is the computation of the
> > version
> > > > > string.  The version string is what is used to figure out if a change
> > > > took
> > > > > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> > > > right
> > > > > thing.  For deleted documents, obviously the processDocuments()
> > should
> > > > call
> > > > > the activities.deleteDocument() method.
> > > >
> > > > The version String is calculated by `processDocuments`. Since after
> > > > calling `addSeedDocuments` once for document A version 1,
> > > > `processDocuments` is never called again for that document, even
> > > > though it has been modified to document A version 2. Therefore, our
> > > > connector never gets a chance to return the "version 2" string.
> > > >
> > > > > Does this sound like what your code is doing?
> > > >
> > > > Yes, as far as we can go given the fact that `processDocuments` is
> > > > only called once for any particular document identifier.
> > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <rocketra...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > My team is creating a new repository connector. The source system
> > has
> > > > > > a delta API that lets us know of all new, modified, and deleted
> > > > > > individual folders and documents since the last call to the API.
> > Each
> > > > > > call to the delta API provides the changes, as well as a token
> > which
> > > > > > can be provided on subsequent calls to get changes since that token
> > > > > > was generated/returned.
> > > > > >
> > > > > > What is the best approach to building a repo connector to a system
> > > > > > that has this type of delta API?
> > > > > >
> > > > > > Our first design was an implementation that specifies
> > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > >
> > > > > > * In addSeedDocuments, on the initial call we seed every document
> > in
> > > > > > the source system. On subsequent calls, we use the delta API to
> > seed
> > > > > > every added, modified, or deleted file. We return the delta API
> > token
> > > > > > as the version value of addSeedDocuments, so that it an be used on
> > > > > > subsequent calls.
> > > > > >
> > > > > > * In processDocuments, we do the usual thing for each document
> > > > identifier.
> > > > > >
> > > > > > On prototyping, this works for new docs, but "processDocuments" is
> > > > > > never triggered for modified and deleted docs.
> > > > > >
> > > > > > A second design we are considering is to use
> > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return
> > only
> > > > > > one "virtual" document, which represents the root of the remote
> > repo.
> > > > > >
> > > > > > Then, in "processDocuments" the new "document" is used to determine
> > > > > > all the child documents of that delta call, which are then added to
> > > > > > the queue via `activities.addDocumentReference`. To force the
> > "virtual
> > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > `addSeedDocuments`, we do
> > `activities.deleteDocument(virtualDocId)` as
> > > > > > well.
> > > > > >
> > > > > > With this alternative design, the stage 1 seed effectively becomes
> > a
> > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Regards,
> > > > > > Raman Gupta
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Reply via email to