Yes, we are indeed running it in continuous crawl mode. Scheduled mode works, but given we have a delta API, we thought this is what makes sense, as the delta API is efficient and we don't need to wait an entire day for a scheduled job to run. I see that if I change recrawl interval and max recrawl interval also to 1 minute, then my documents do get processed each time. However, now we have the opposite problem: now the documents are reprocessed every minute, regardless of whether they were reseeded or not, which makes no sense to me. If I am using MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method, then why are the same documents being reprocessed over and over? I have sent the output to the NullOutput using `ingestDocumentWithException` and the status shows OK, and yet the same documents are repeatedly sent to processDocuments.
I just want to process the particular documents I specify on each iteration every 60 seconds -- no more, no less, and yet I seem unable to build a connector that does this. If I move to a non-contiguous mode, do I really have to create 1440 schedule objects, one for each minute of each day? The way the schedule seems to be put together, I don't see a way to just schedule every minute with one schedule. I would have expected schedules to just use cron expressions. If I move to the design #2 in my OP and have one "virtual document" to just avoid the seeding stage all-together, then is there some place where I can store the delta token state? Or does my connector have to create its own db table to store this? Regards, Raman On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com> wrote: > > So MODEL_ADD_CHANGE does not work for you, eh? > > You were saying that every minute a addSeedDocuments is being called, > correct? It sounds to me like you are running this job in continuous crawl > mode. Can you try running the job in non-continuous mode, and just > repeating the job run once it completes? > > The reason I ask is because continuous crawling has very unique kinds of > ways of dealing with documents it has crawled. It uses "exponential > backoff" to schedule the next document crawl and that is probably why you > see the documents in the queue but not being processed; you simply haven't > waited long enough. > > Karl > > Karl > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <rocketra...@gmail.com> wrote: > > > Here are my addSeedDocuments and processDocuments methods simplifying > > them down to the minimum necessary to show what is happening: > > > > @Override > > public String addSeedDocuments(ISeedingActivity activities, Specification > > spec, > > String lastSeedVersion, long seedTime, > > int jobMode) > > throws ManifoldCFException, ServiceInterruption > > { > > // return the same 3 docs every time, simulating an initial load, and > > then > > // these 3 docs changing constantly > > System.out.println(String.format("-=-= SeedTime=%s", seedTime)); > > activities.addSeedDocument("100'); > > activities.addSeedDocument("110'); > > activities.addSeedDocument("120'); > > System.out.println("SEEDING DONE"); > > return null > > } > > > > @Override > > public void processDocuments(String[] documentIdentifiers, > > IExistingVersions statuses, Specification spec, > > IProcessActivity activities, int jobMode, > > boolean usesDefaultAuthority) > > throws ManifoldCFException, ServiceInterruption { > > System.out.println("-=--=-= PROCESS DOCUMENTS: " + > > Arrays.deepToString(documentIdentifiers) ); > > // for (String documentIdentifier : documentIdentifiers) { > > // activities.deleteDocument(documentIdentifier); > > //} > > > > // I've commented out all subsequent logic here, but adding the call to > > // activities.ingestDocumentWithException(documentIdentifier, > > version, documentUri, rd); > > // does not change anything > > } > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with > > MODEL_ADD_CHANGE, the output of this is: > > > > -=-= SeedTime=1558733436082 > > -=--=-= PROCESS DOCUMENTS: [200] > > -=--=-= PROCESS DOCUMENTS: [220] > > -=--=-= PROCESS DOCUMENTS: [210] > > -=-= SeedTime=1558733549367 > > -=-= SeedTime=1558733609384 > > -=-= SeedTime=1558733436082 > > etc. > > > > "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then > > never again, even though "SEEDING DONE" is printing every minute. If > > and only if I uncomment the for loop which deletes the documents does > > "processDocuments" get called again for those seed document ids. > > > > I do note that the queue shows documents 100, 110, and 120 in state > > "Waiting for processing", and nothing I do seems to affect that. The > > database update in JobQueue.updateExistingRecordInitial is a no-op for > > these docs, as the status of them is STATUS_PENDINGPURGATORY and the > > update does not actually change anything in the db. > > > > Regards, > > Raman > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com> wrote: > > > > > > For any given job run, all documents that are added via > > addSeedDocuments() > > > should be processed. There is no magic in the framework that somehow > > knows > > > that a document has been created vs. modified vs. deleted until > > > processDocuments() is called. If your claim is that this contract is not > > > being honored, could you try changing your connector model to > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work > > > using that model. If it does *not* then clearly you've got some kind of > > > implementation problem at the addSeedDocuments() level because most of > > the > > > Manifold connectors use that model. > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure > > > out why MODEL_ADD_CHANGE_DELETE is failing. > > > > > > Karl > > > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <rocketra...@gmail.com> > > wrote: > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <daddy...@gmail.com> > > wrote: > > > > > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically > > says > > > > > that you have to include *at least* the documents that were changed, > > > > added, > > > > > or deleted since the previous stamp, and if no stamp is provided, it > > > > should > > > > > return ALL specified documents. Are you doing that? > > > > > > > > Yes, the delta API gives us all the changed, added, and deleted > > > > documents, and those are exactly the ones that we are including. > > > > > > > > > If you are, the next thing to look at is the computation of the > > version > > > > > string. The version string is what is used to figure out if a change > > > > took > > > > > place. You need this IN ADDITION TO the addSeedDocuments() doing the > > > > right > > > > > thing. For deleted documents, obviously the processDocuments() > > should > > > > call > > > > > the activities.deleteDocument() method. > > > > > > > > The version String is calculated by `processDocuments`. Since after > > > > calling `addSeedDocuments` once for document A version 1, > > > > `processDocuments` is never called again for that document, even > > > > though it has been modified to document A version 2. Therefore, our > > > > connector never gets a chance to return the "version 2" string. > > > > > > > > > Does this sound like what your code is doing? > > > > > > > > Yes, as far as we can go given the fact that `processDocuments` is > > > > only called once for any particular document identifier. > > > > > > > > > Karl > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <rocketra...@gmail.com> > > > > wrote: > > > > > > > > > > > My team is creating a new repository connector. The source system > > has > > > > > > a delta API that lets us know of all new, modified, and deleted > > > > > > individual folders and documents since the last call to the API. > > Each > > > > > > call to the delta API provides the changes, as well as a token > > which > > > > > > can be provided on subsequent calls to get changes since that token > > > > > > was generated/returned. > > > > > > > > > > > > What is the best approach to building a repo connector to a system > > > > > > that has this type of delta API? > > > > > > > > > > > > Our first design was an implementation that specifies > > > > > > `MODEL_ADD_CHANGE_DELETE` and then: > > > > > > > > > > > > * In addSeedDocuments, on the initial call we seed every document > > in > > > > > > the source system. On subsequent calls, we use the delta API to > > seed > > > > > > every added, modified, or deleted file. We return the delta API > > token > > > > > > as the version value of addSeedDocuments, so that it an be used on > > > > > > subsequent calls. > > > > > > > > > > > > * In processDocuments, we do the usual thing for each document > > > > identifier. > > > > > > > > > > > > On prototyping, this works for new docs, but "processDocuments" is > > > > > > never triggered for modified and deleted docs. > > > > > > > > > > > > A second design we are considering is to use > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return > > only > > > > > > one "virtual" document, which represents the root of the remote > > repo. > > > > > > > > > > > > Then, in "processDocuments" the new "document" is used to determine > > > > > > all the child documents of that delta call, which are then added to > > > > > > the queue via `activities.addDocumentReference`. To force the > > "virtual > > > > > > seed" to trigger processDocuments again on the next call to > > > > > > `addSeedDocuments`, we do > > `activities.deleteDocument(virtualDocId)` as > > > > > > well. > > > > > > > > > > > > With this alternative design, the stage 1 seed effectively becomes > > a > > > > > > no-op, and is just used as a mechanism to trigger stage 2. > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > Regards, > > > > > > Raman Gupta > > > > > > > > > > > >