One seed per job is an interesting approach but in the interests of fully understanding the alternatives let me consider choice #2.
> you might want to combine this all into one job, but then you would need to > link your documents somehow to the seed they came from, so that if the seed > was no longer part of the job specification, it could always be detected as a > deletion. There are good reasons for me to prefer a single job, so how would I accomplish this? Should my connector create its own tables and manage this state there? Or is there another more light-weight approach? > Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE, because > under that scheme you'd need to *detect* the deletion, because you wouldn't > be told by the repository that somebody had changed the configuration. That is fine and I understand completely -- I forgot to mention in my previous message that I've already switched to MODEL_ALL, and am detecting and providing the list of currently active seeds on every call to addSeedDocuments. Regards, Raman On Mon, May 27, 2019 at 4:55 PM Karl Wright <daddy...@gmail.com> wrote: > > This is very different from the design you originally told me you were > going to do. > > Generally, using hopcounts for managing your documents is a bad practice; > this is expensive to do and almost always yields unexpected results. > You could have one job per seed, which means all you have to do to make the > seed go away is delete the job corresponding to it. If you have way too > many seeds for that, you might want to combine this all into one job, but > then you would need to link your documents somehow to the seed they came > from, so that if the seed was no longer part of the job specification, it > could always be detected as a deletion. Unfortunately, this is > inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme you'd > need to *detect* the deletion, because you wouldn't be told by the > repository that somebody had changed the configuration. > > So two choices: (1) Exactly one seed per job, or (2) don't use > MODEL_ADD_CHANGE_DELETE. > > Karl > > > On Mon, May 27, 2019 at 4:38 PM Raman Gupta <rocketra...@gmail.com> wrote: > > > Thanks for your help Karl. So I think I'm converging on a design. > > First of all, per your recommendation, I've switched to scheduled > > crawl and it executes as expected every minute with the "schedule > > window anytime" setting. > > > > My next problem is dealing with seed deletion. My upstream source > > actually has multiple "roots" i.e. each root has its own set of > > documents, and the delta API must be called once for each "root". To > > deal with this, I'm specifying each "root" as a "seed document", and > > each such root/seed creates "contained_in" documents. It is also > > possible for a "root" to be deleted by a user of the upstream system. > > > > My job is defined with an accurate hopcount as follows: > > > > "job": { > > ... snip naming, scheduling, output connectors, doc spec.... > > "hopcount_mode" to "accurate" > > "hopcount" to json { > > "link_type" to "contained_in" > > "count" to 1 > > }, > > > > For each seed, in processDocuments I am doing: > > > > activities.addDocumentReference("... doc identifier ...", > > seedDocumentIdentifier, "contained_in"); > > > > and then this triggers processDocuments for each of those documents, > > as expected. > > > > How do I code the connector such that I can now remove the documents > > that are now unreachable due to the deleted seed? I don't see any > > calls to `processDocuments` via the framework that would allow me to > > do this. > > > > Regards, > > Raman > > > > > > On Fri, May 24, 2019 at 7:29 PM Karl Wright <daddy...@gmail.com> wrote: > > > > > > Hi Raman, > > > > > > (1) Continuous crawl is not a good model for you. It's meant for > > crawling > > > large web domains, not the kind of task you are doing. > > > (2) Scheduled crawl will work fine for you if you simply tell it "start > > > within schedule window" and make sure your schedule completely covers > > 7x24 > > > times. So you can do this with one record, which triggers on every day > > of > > > the week, that has a schedule window of 24 hours. > > > > > > Karl > > > > > > > > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <rocketra...@gmail.com> > > wrote: > > > > > > > Yes, we are indeed running it in continuous crawl mode. Scheduled mode > > > > works, but given we have a delta API, we thought this is what makes > > > > sense, as the delta API is efficient and we don't need to wait an > > > > entire day for a scheduled job to run. I see that if I change recrawl > > > > interval and max recrawl interval also to 1 minute, then my documents > > > > do get processed each time. However, now we have the opposite problem: > > > > now the documents are reprocessed every minute, regardless of whether > > > > they were reseeded or not, which makes no sense to me. If I am using > > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method, > > > > then why are the same documents being reprocessed over and over? I > > > > have sent the output to the NullOutput using > > > > `ingestDocumentWithException` and the status shows OK, and yet the > > > > same documents are repeatedly sent to processDocuments. > > > > > > > > I just want to process the particular documents I specify on each > > > > iteration every 60 seconds -- no more, no less, and yet I seem unable > > > > to build a connector that does this. > > > > > > > > If I move to a non-contiguous mode, do I really have to create 1440 > > > > schedule objects, one for each minute of each day? The way the > > > > schedule seems to be put together, I don't see a way to just schedule > > > > every minute with one schedule. I would have expected schedules to > > > > just use cron expressions. > > > > > > > > If I move to the design #2 in my OP and have one "virtual document" to > > > > just avoid the seeding stage all-together, then is there some place > > > > where I can store the delta token state? Or does my connector have to > > > > create its own db table to store this? > > > > > > > > Regards, > > > > Raman > > > > > > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com> > > wrote: > > > > > > > > > > So MODEL_ADD_CHANGE does not work for you, eh? > > > > > > > > > > You were saying that every minute a addSeedDocuments is being called, > > > > > correct? It sounds to me like you are running this job in continuous > > > > crawl > > > > > mode. Can you try running the job in non-continuous mode, and just > > > > > repeating the job run once it completes? > > > > > > > > > > The reason I ask is because continuous crawling has very unique > > kinds of > > > > > ways of dealing with documents it has crawled. It uses "exponential > > > > > backoff" to schedule the next document crawl and that is probably > > why you > > > > > see the documents in the queue but not being processed; you simply > > > > haven't > > > > > waited long enough. > > > > > > > > > > Karl > > > > > > > > > > Karl > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <rocketra...@gmail.com> > > > > wrote: > > > > > > > > > > > Here are my addSeedDocuments and processDocuments methods > > simplifying > > > > > > them down to the minimum necessary to show what is happening: > > > > > > > > > > > > @Override > > > > > > public String addSeedDocuments(ISeedingActivity activities, > > > > Specification > > > > > > spec, > > > > > > String lastSeedVersion, long > > seedTime, > > > > > > int jobMode) > > > > > > throws ManifoldCFException, ServiceInterruption > > > > > > { > > > > > > // return the same 3 docs every time, simulating an initial > > load, and > > > > > > then > > > > > > // these 3 docs changing constantly > > > > > > System.out.println(String.format("-=-= SeedTime=%s", seedTime)); > > > > > > activities.addSeedDocument("100'); > > > > > > activities.addSeedDocument("110'); > > > > > > activities.addSeedDocument("120'); > > > > > > System.out.println("SEEDING DONE"); > > > > > > return null > > > > > > } > > > > > > > > > > > > @Override > > > > > > public void processDocuments(String[] documentIdentifiers, > > > > > > IExistingVersions statuses, Specification spec, > > > > > > IProcessActivity activities, int > > jobMode, > > > > > > boolean usesDefaultAuthority) > > > > > > throws ManifoldCFException, ServiceInterruption { > > > > > > System.out.println("-=--=-= PROCESS DOCUMENTS: " + > > > > > > Arrays.deepToString(documentIdentifiers) ); > > > > > > // for (String documentIdentifier : documentIdentifiers) { > > > > > > // activities.deleteDocument(documentIdentifier); > > > > > > //} > > > > > > > > > > > > // I've commented out all subsequent logic here, but adding the > > call > > > > to > > > > > > // activities.ingestDocumentWithException(documentIdentifier, > > > > > > version, documentUri, rd); > > > > > > // does not change anything > > > > > > } > > > > > > > > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with > > > > > > MODEL_ADD_CHANGE, the output of this is: > > > > > > > > > > > > -=-= SeedTime=1558733436082 > > > > > > -=--=-= PROCESS DOCUMENTS: [200] > > > > > > -=--=-= PROCESS DOCUMENTS: [220] > > > > > > -=--=-= PROCESS DOCUMENTS: [210] > > > > > > -=-= SeedTime=1558733549367 > > > > > > -=-= SeedTime=1558733609384 > > > > > > -=-= SeedTime=1558733436082 > > > > > > etc. > > > > > > > > > > > > "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and > > then > > > > > > never again, even though "SEEDING DONE" is printing every minute. > > If > > > > > > and only if I uncomment the for loop which deletes the documents > > does > > > > > > "processDocuments" get called again for those seed document ids. > > > > > > > > > > > > I do note that the queue shows documents 100, 110, and 120 in state > > > > > > "Waiting for processing", and nothing I do seems to affect that. > > The > > > > > > database update in JobQueue.updateExistingRecordInitial is a no-op > > for > > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY and > > the > > > > > > update does not actually change anything in the db. > > > > > > > > > > > > Regards, > > > > > > Raman > > > > > > > > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > For any given job run, all documents that are added via > > > > > > addSeedDocuments() > > > > > > > should be processed. There is no magic in the framework that > > somehow > > > > > > knows > > > > > > > that a document has been created vs. modified vs. deleted until > > > > > > > processDocuments() is called. If your claim is that this > > contract > > > > is not > > > > > > > being honored, could you try changing your connector model to > > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to > > > > work > > > > > > > using that model. If it does *not* then clearly you've got some > > > > kind of > > > > > > > implementation problem at the addSeedDocuments() level because > > most > > > > of > > > > > > the > > > > > > > Manifold connectors use that model. > > > > > > > > > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is > > to > > > > figure > > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing. > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta < > > rocketra...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright < > > daddy...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() > > > > basically > > > > > > says > > > > > > > > > that you have to include *at least* the documents that were > > > > changed, > > > > > > > > added, > > > > > > > > > or deleted since the previous stamp, and if no stamp is > > > > provided, it > > > > > > > > should > > > > > > > > > return ALL specified documents. Are you doing that? > > > > > > > > > > > > > > > > Yes, the delta API gives us all the changed, added, and deleted > > > > > > > > documents, and those are exactly the ones that we are > > including. > > > > > > > > > > > > > > > > > If you are, the next thing to look at is the computation of > > the > > > > > > version > > > > > > > > > string. The version string is what is used to figure out if > > a > > > > change > > > > > > > > took > > > > > > > > > place. You need this IN ADDITION TO the addSeedDocuments() > > > > doing the > > > > > > > > right > > > > > > > > > thing. For deleted documents, obviously the > > processDocuments() > > > > > > should > > > > > > > > call > > > > > > > > > the activities.deleteDocument() method. > > > > > > > > > > > > > > > > The version String is calculated by `processDocuments`. Since > > after > > > > > > > > calling `addSeedDocuments` once for document A version 1, > > > > > > > > `processDocuments` is never called again for that document, > > even > > > > > > > > though it has been modified to document A version 2. > > Therefore, our > > > > > > > > connector never gets a chance to return the "version 2" string. > > > > > > > > > > > > > > > > > Does this sound like what your code is doing? > > > > > > > > > > > > > > > > Yes, as far as we can go given the fact that > > `processDocuments` is > > > > > > > > only called once for any particular document identifier. > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta < > > > > rocketra...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > My team is creating a new repository connector. The source > > > > system > > > > > > has > > > > > > > > > > a delta API that lets us know of all new, modified, and > > deleted > > > > > > > > > > individual folders and documents since the last call to the > > > > API. > > > > > > Each > > > > > > > > > > call to the delta API provides the changes, as well as a > > token > > > > > > which > > > > > > > > > > can be provided on subsequent calls to get changes since > > that > > > > token > > > > > > > > > > was generated/returned. > > > > > > > > > > > > > > > > > > > > What is the best approach to building a repo connector to a > > > > system > > > > > > > > > > that has this type of delta API? > > > > > > > > > > > > > > > > > > > > Our first design was an implementation that specifies > > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then: > > > > > > > > > > > > > > > > > > > > * In addSeedDocuments, on the initial call we seed every > > > > document > > > > > > in > > > > > > > > > > the source system. On subsequent calls, we use the delta > > API to > > > > > > seed > > > > > > > > > > every added, modified, or deleted file. We return the > > delta API > > > > > > token > > > > > > > > > > as the version value of addSeedDocuments, so that it an be > > > > used on > > > > > > > > > > subsequent calls. > > > > > > > > > > > > > > > > > > > > * In processDocuments, we do the usual thing for each > > document > > > > > > > > identifier. > > > > > > > > > > > > > > > > > > > > On prototyping, this works for new docs, but > > > > "processDocuments" is > > > > > > > > > > never triggered for modified and deleted docs. > > > > > > > > > > > > > > > > > > > > A second design we are considering is to use > > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments > > > > return > > > > > > only > > > > > > > > > > one "virtual" document, which represents the root of the > > remote > > > > > > repo. > > > > > > > > > > > > > > > > > > > > Then, in "processDocuments" the new "document" is used to > > > > determine > > > > > > > > > > all the child documents of that delta call, which are then > > > > added to > > > > > > > > > > the queue via `activities.addDocumentReference`. To force > > the > > > > > > "virtual > > > > > > > > > > seed" to trigger processDocuments again on the next call to > > > > > > > > > > `addSeedDocuments`, we do > > > > > > `activities.deleteDocument(virtualDocId)` as > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > With this alternative design, the stage 1 seed effectively > > > > becomes > > > > > > a > > > > > > > > > > no-op, and is just used as a mechanism to trigger stage 2. > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Raman Gupta > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >