Here are my addSeedDocuments and processDocuments methods simplifying them down to the minimum necessary to show what is happening:
@Override public String addSeedDocuments(ISeedingActivity activities, Specification spec, String lastSeedVersion, long seedTime, int jobMode) throws ManifoldCFException, ServiceInterruption { // return the same 3 docs every time, simulating an initial load, and then // these 3 docs changing constantly System.out.println(String.format("-=-= SeedTime=%s", seedTime)); activities.addSeedDocument("100'); activities.addSeedDocument("110'); activities.addSeedDocument("120'); System.out.println("SEEDING DONE"); return null } @Override public void processDocuments(String[] documentIdentifiers, IExistingVersions statuses, Specification spec, IProcessActivity activities, int jobMode, boolean usesDefaultAuthority) throws ManifoldCFException, ServiceInterruption { System.out.println("-=--=-= PROCESS DOCUMENTS: " + Arrays.deepToString(documentIdentifiers) ); // for (String documentIdentifier : documentIdentifiers) { // activities.deleteDocument(documentIdentifier); //} // I've commented out all subsequent logic here, but adding the call to // activities.ingestDocumentWithException(documentIdentifier, version, documentUri, rd); // does not change anything } When I run this code with MODEL_ADD_CHANGE_DELETE or with MODEL_ADD_CHANGE, the output of this is: -=-= SeedTime=1558733436082 -=--=-= PROCESS DOCUMENTS: [200] -=--=-= PROCESS DOCUMENTS: [220] -=--=-= PROCESS DOCUMENTS: [210] -=-= SeedTime=1558733549367 -=-= SeedTime=1558733609384 -=-= SeedTime=1558733436082 etc. "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then never again, even though "SEEDING DONE" is printing every minute. If and only if I uncomment the for loop which deletes the documents does "processDocuments" get called again for those seed document ids. I do note that the queue shows documents 100, 110, and 120 in state "Waiting for processing", and nothing I do seems to affect that. The database update in JobQueue.updateExistingRecordInitial is a no-op for these docs, as the status of them is STATUS_PENDINGPURGATORY and the update does not actually change anything in the db. Regards, Raman On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com> wrote: > > For any given job run, all documents that are added via addSeedDocuments() > should be processed. There is no magic in the framework that somehow knows > that a document has been created vs. modified vs. deleted until > processDocuments() is called. If your claim is that this contract is not > being honored, could you try changing your connector model to > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work > using that model. If it does *not* then clearly you've got some kind of > implementation problem at the addSeedDocuments() level because most of the > Manifold connectors use that model. > > If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure > out why MODEL_ADD_CHANGE_DELETE is failing. > > Karl > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <rocketra...@gmail.com> wrote: > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <daddy...@gmail.com> wrote: > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says > > > that you have to include *at least* the documents that were changed, > > added, > > > or deleted since the previous stamp, and if no stamp is provided, it > > should > > > return ALL specified documents. Are you doing that? > > > > Yes, the delta API gives us all the changed, added, and deleted > > documents, and those are exactly the ones that we are including. > > > > > If you are, the next thing to look at is the computation of the version > > > string. The version string is what is used to figure out if a change > > took > > > place. You need this IN ADDITION TO the addSeedDocuments() doing the > > right > > > thing. For deleted documents, obviously the processDocuments() should > > call > > > the activities.deleteDocument() method. > > > > The version String is calculated by `processDocuments`. Since after > > calling `addSeedDocuments` once for document A version 1, > > `processDocuments` is never called again for that document, even > > though it has been modified to document A version 2. Therefore, our > > connector never gets a chance to return the "version 2" string. > > > > > Does this sound like what your code is doing? > > > > Yes, as far as we can go given the fact that `processDocuments` is > > only called once for any particular document identifier. > > > > > Karl > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <rocketra...@gmail.com> > > wrote: > > > > > > > My team is creating a new repository connector. The source system has > > > > a delta API that lets us know of all new, modified, and deleted > > > > individual folders and documents since the last call to the API. Each > > > > call to the delta API provides the changes, as well as a token which > > > > can be provided on subsequent calls to get changes since that token > > > > was generated/returned. > > > > > > > > What is the best approach to building a repo connector to a system > > > > that has this type of delta API? > > > > > > > > Our first design was an implementation that specifies > > > > `MODEL_ADD_CHANGE_DELETE` and then: > > > > > > > > * In addSeedDocuments, on the initial call we seed every document in > > > > the source system. On subsequent calls, we use the delta API to seed > > > > every added, modified, or deleted file. We return the delta API token > > > > as the version value of addSeedDocuments, so that it an be used on > > > > subsequent calls. > > > > > > > > * In processDocuments, we do the usual thing for each document > > identifier. > > > > > > > > On prototyping, this works for new docs, but "processDocuments" is > > > > never triggered for modified and deleted docs. > > > > > > > > A second design we are considering is to use > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only > > > > one "virtual" document, which represents the root of the remote repo. > > > > > > > > Then, in "processDocuments" the new "document" is used to determine > > > > all the child documents of that delta call, which are then added to > > > > the queue via `activities.addDocumentReference`. To force the "virtual > > > > seed" to trigger processDocuments again on the next call to > > > > `addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as > > > > well. > > > > > > > > With this alternative design, the stage 1 seed effectively becomes a > > > > no-op, and is just used as a mechanism to trigger stage 2. > > > > > > > > Thoughts? > > > > > > > > Regards, > > > > Raman Gupta > > > > > >