On Mon, May 27, 2019 at 5:58 PM Karl Wright <daddy...@gmail.com> wrote: > > (1) There should be no new tables needed for any of this. Your seed list > can be stored in the job specification information. See the rss connector > for simple example of how this might be done.
Are you assuming the seed list is static? The RSS connector only changes the job specification in the `processSpecificationPost` method, and I assume that the job spec is read-only in `addSeedDocuments`. As I thought I made clear in previous messages, my seed list is dynamic, which is why I switched to MODEL_ALL -- on each call to addSeedDocuments, I can dynamically determine which seeds are relevant, and only provide that list. To provide additional context, the dynamic seed list is based on regular expression matches, where the underlying seeds/roots can come and go based on which ones match the regexes, and the regexes are present in the document spec. > (2) If you have switched to MODEL_ALL then all you need to do is provide a > mechanism for any given document for determining which seed it comes from, > and simply look for that in the job specification. If not there, call > activities.removeDocument(). See above. Regards, Raman > Karl > > > On Mon, May 27, 2019 at 5:16 PM Raman Gupta <rocketra...@gmail.com> wrote: > > > One seed per job is an interesting approach but in the interests of > > fully understanding the alternatives let me consider choice #2. > > > > > you might want to combine this all into one job, but then you would > > need to link your documents somehow to the seed they came from, so that if > > the seed was no longer part of the job specification, it could always be > > detected as a deletion. > > > > There are good reasons for me to prefer a single job, so how would I > > accomplish this? Should my connector create its own tables and manage > > this state there? Or is there another more light-weight approach? > > > > > Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE, > > because under that scheme you'd need to *detect* the deletion, because you > > wouldn't be told by the repository that somebody had changed the > > configuration. > > > > That is fine and I understand completely -- I forgot to mention in my > > previous message that I've already switched to MODEL_ALL, and am > > detecting and providing the list of currently active seeds on every > > call to addSeedDocuments. > > > > Regards, > > Raman > > > > On Mon, May 27, 2019 at 4:55 PM Karl Wright <daddy...@gmail.com> wrote: > > > > > > This is very different from the design you originally told me you were > > > going to do. > > > > > > Generally, using hopcounts for managing your documents is a bad practice; > > > this is expensive to do and almost always yields unexpected results. > > > You could have one job per seed, which means all you have to do to make > > the > > > seed go away is delete the job corresponding to it. If you have way too > > > many seeds for that, you might want to combine this all into one job, but > > > then you would need to link your documents somehow to the seed they came > > > from, so that if the seed was no longer part of the job specification, it > > > could always be detected as a deletion. Unfortunately, this is > > > inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme > > you'd > > > need to *detect* the deletion, because you wouldn't be told by the > > > repository that somebody had changed the configuration. > > > > > > So two choices: (1) Exactly one seed per job, or (2) don't use > > > MODEL_ADD_CHANGE_DELETE. > > > > > > Karl > > > > > > > > > On Mon, May 27, 2019 at 4:38 PM Raman Gupta <rocketra...@gmail.com> > > wrote: > > > > > > > Thanks for your help Karl. So I think I'm converging on a design. > > > > First of all, per your recommendation, I've switched to scheduled > > > > crawl and it executes as expected every minute with the "schedule > > > > window anytime" setting. > > > > > > > > My next problem is dealing with seed deletion. My upstream source > > > > actually has multiple "roots" i.e. each root has its own set of > > > > documents, and the delta API must be called once for each "root". To > > > > deal with this, I'm specifying each "root" as a "seed document", and > > > > each such root/seed creates "contained_in" documents. It is also > > > > possible for a "root" to be deleted by a user of the upstream system. > > > > > > > > My job is defined with an accurate hopcount as follows: > > > > > > > > "job": { > > > > ... snip naming, scheduling, output connectors, doc spec.... > > > > "hopcount_mode" to "accurate" > > > > "hopcount" to json { > > > > "link_type" to "contained_in" > > > > "count" to 1 > > > > }, > > > > > > > > For each seed, in processDocuments I am doing: > > > > > > > > activities.addDocumentReference("... doc identifier ...", > > > > seedDocumentIdentifier, "contained_in"); > > > > > > > > and then this triggers processDocuments for each of those documents, > > > > as expected. > > > > > > > > How do I code the connector such that I can now remove the documents > > > > that are now unreachable due to the deleted seed? I don't see any > > > > calls to `processDocuments` via the framework that would allow me to > > > > do this. > > > > > > > > Regards, > > > > Raman > > > > > > > > > > > > On Fri, May 24, 2019 at 7:29 PM Karl Wright <daddy...@gmail.com> > > wrote: > > > > > > > > > > Hi Raman, > > > > > > > > > > (1) Continuous crawl is not a good model for you. It's meant for > > > > crawling > > > > > large web domains, not the kind of task you are doing. > > > > > (2) Scheduled crawl will work fine for you if you simply tell it > > "start > > > > > within schedule window" and make sure your schedule completely covers > > > > 7x24 > > > > > times. So you can do this with one record, which triggers on every > > day > > > > of > > > > > the week, that has a schedule window of 24 hours. > > > > > > > > > > Karl > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <rocketra...@gmail.com> > > > > wrote: > > > > > > > > > > > Yes, we are indeed running it in continuous crawl mode. Scheduled > > mode > > > > > > works, but given we have a delta API, we thought this is what makes > > > > > > sense, as the delta API is efficient and we don't need to wait an > > > > > > entire day for a scheduled job to run. I see that if I change > > recrawl > > > > > > interval and max recrawl interval also to 1 minute, then my > > documents > > > > > > do get processed each time. However, now we have the opposite > > problem: > > > > > > now the documents are reprocessed every minute, regardless of > > whether > > > > > > they were reseeded or not, which makes no sense to me. If I am > > using > > > > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed > > method, > > > > > > then why are the same documents being reprocessed over and over? I > > > > > > have sent the output to the NullOutput using > > > > > > `ingestDocumentWithException` and the status shows OK, and yet the > > > > > > same documents are repeatedly sent to processDocuments. > > > > > > > > > > > > I just want to process the particular documents I specify on each > > > > > > iteration every 60 seconds -- no more, no less, and yet I seem > > unable > > > > > > to build a connector that does this. > > > > > > > > > > > > If I move to a non-contiguous mode, do I really have to create 1440 > > > > > > schedule objects, one for each minute of each day? The way the > > > > > > schedule seems to be put together, I don't see a way to just > > schedule > > > > > > every minute with one schedule. I would have expected schedules to > > > > > > just use cron expressions. > > > > > > > > > > > > If I move to the design #2 in my OP and have one "virtual > > document" to > > > > > > just avoid the seeding stage all-together, then is there some place > > > > > > where I can store the delta token state? Or does my connector have > > to > > > > > > create its own db table to store this? > > > > > > > > > > > > Regards, > > > > > > Raman > > > > > > > > > > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > So MODEL_ADD_CHANGE does not work for you, eh? > > > > > > > > > > > > > > You were saying that every minute a addSeedDocuments is being > > called, > > > > > > > correct? It sounds to me like you are running this job in > > continuous > > > > > > crawl > > > > > > > mode. Can you try running the job in non-continuous mode, and > > just > > > > > > > repeating the job run once it completes? > > > > > > > > > > > > > > The reason I ask is because continuous crawling has very unique > > > > kinds of > > > > > > > ways of dealing with documents it has crawled. It uses > > "exponential > > > > > > > backoff" to schedule the next document crawl and that is probably > > > > why you > > > > > > > see the documents in the queue but not being processed; you > > simply > > > > > > haven't > > > > > > > waited long enough. > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta < > > rocketra...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Here are my addSeedDocuments and processDocuments methods > > > > simplifying > > > > > > > > them down to the minimum necessary to show what is happening: > > > > > > > > > > > > > > > > @Override > > > > > > > > public String addSeedDocuments(ISeedingActivity activities, > > > > > > Specification > > > > > > > > spec, > > > > > > > > String lastSeedVersion, long > > > > seedTime, > > > > > > > > int jobMode) > > > > > > > > throws ManifoldCFException, ServiceInterruption > > > > > > > > { > > > > > > > > // return the same 3 docs every time, simulating an initial > > > > load, and > > > > > > > > then > > > > > > > > // these 3 docs changing constantly > > > > > > > > System.out.println(String.format("-=-= SeedTime=%s", > > seedTime)); > > > > > > > > activities.addSeedDocument("100'); > > > > > > > > activities.addSeedDocument("110'); > > > > > > > > activities.addSeedDocument("120'); > > > > > > > > System.out.println("SEEDING DONE"); > > > > > > > > return null > > > > > > > > } > > > > > > > > > > > > > > > > @Override > > > > > > > > public void processDocuments(String[] documentIdentifiers, > > > > > > > > IExistingVersions statuses, Specification spec, > > > > > > > > IProcessActivity activities, int > > > > jobMode, > > > > > > > > boolean usesDefaultAuthority) > > > > > > > > throws ManifoldCFException, ServiceInterruption { > > > > > > > > System.out.println("-=--=-= PROCESS DOCUMENTS: " + > > > > > > > > Arrays.deepToString(documentIdentifiers) ); > > > > > > > > // for (String documentIdentifier : documentIdentifiers) { > > > > > > > > // activities.deleteDocument(documentIdentifier); > > > > > > > > //} > > > > > > > > > > > > > > > > // I've commented out all subsequent logic here, but adding > > the > > > > call > > > > > > to > > > > > > > > // activities.ingestDocumentWithException(documentIdentifier, > > > > > > > > version, documentUri, rd); > > > > > > > > // does not change anything > > > > > > > > } > > > > > > > > > > > > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with > > > > > > > > MODEL_ADD_CHANGE, the output of this is: > > > > > > > > > > > > > > > > -=-= SeedTime=1558733436082 > > > > > > > > -=--=-= PROCESS DOCUMENTS: [200] > > > > > > > > -=--=-= PROCESS DOCUMENTS: [220] > > > > > > > > -=--=-= PROCESS DOCUMENTS: [210] > > > > > > > > -=-= SeedTime=1558733549367 > > > > > > > > -=-= SeedTime=1558733609384 > > > > > > > > -=-= SeedTime=1558733436082 > > > > > > > > etc. > > > > > > > > > > > > > > > > "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and > > > > then > > > > > > > > never again, even though "SEEDING DONE" is printing every > > minute. > > > > If > > > > > > > > and only if I uncomment the for loop which deletes the > > documents > > > > does > > > > > > > > "processDocuments" get called again for those seed document > > ids. > > > > > > > > > > > > > > > > I do note that the queue shows documents 100, 110, and 120 in > > state > > > > > > > > "Waiting for processing", and nothing I do seems to affect > > that. > > > > The > > > > > > > > database update in JobQueue.updateExistingRecordInitial is a > > no-op > > > > for > > > > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY > > and > > > > the > > > > > > > > update does not actually change anything in the db. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Raman > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright < > > daddy...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > For any given job run, all documents that are added via > > > > > > > > addSeedDocuments() > > > > > > > > > should be processed. There is no magic in the framework that > > > > somehow > > > > > > > > knows > > > > > > > > > that a document has been created vs. modified vs. deleted > > until > > > > > > > > > processDocuments() is called. If your claim is that this > > > > contract > > > > > > is not > > > > > > > > > being honored, could you try changing your connector model to > > > > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything > > seems to > > > > > > work > > > > > > > > > using that model. If it does *not* then clearly you've got > > some > > > > > > kind of > > > > > > > > > implementation problem at the addSeedDocuments() level > > because > > > > most > > > > > > of > > > > > > > > the > > > > > > > > > Manifold connectors use that model. > > > > > > > > > > > > > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step > > is > > > > to > > > > > > figure > > > > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing. > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta < > > > > rocketra...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright < > > > > daddy...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > For ADD_CHANGE_DELETE, the contract for > > addSeedDocuments() > > > > > > basically > > > > > > > > says > > > > > > > > > > > that you have to include *at least* the documents that > > were > > > > > > changed, > > > > > > > > > > added, > > > > > > > > > > > or deleted since the previous stamp, and if no stamp is > > > > > > provided, it > > > > > > > > > > should > > > > > > > > > > > return ALL specified documents. Are you doing that? > > > > > > > > > > > > > > > > > > > > Yes, the delta API gives us all the changed, added, and > > deleted > > > > > > > > > > documents, and those are exactly the ones that we are > > > > including. > > > > > > > > > > > > > > > > > > > > > If you are, the next thing to look at is the computation > > of > > > > the > > > > > > > > version > > > > > > > > > > > string. The version string is what is used to figure > > out if > > > > a > > > > > > change > > > > > > > > > > took > > > > > > > > > > > place. You need this IN ADDITION TO the > > addSeedDocuments() > > > > > > doing the > > > > > > > > > > right > > > > > > > > > > > thing. For deleted documents, obviously the > > > > processDocuments() > > > > > > > > should > > > > > > > > > > call > > > > > > > > > > > the activities.deleteDocument() method. > > > > > > > > > > > > > > > > > > > > The version String is calculated by `processDocuments`. > > Since > > > > after > > > > > > > > > > calling `addSeedDocuments` once for document A version 1, > > > > > > > > > > `processDocuments` is never called again for that document, > > > > even > > > > > > > > > > though it has been modified to document A version 2. > > > > Therefore, our > > > > > > > > > > connector never gets a chance to return the "version 2" > > string. > > > > > > > > > > > > > > > > > > > > > Does this sound like what your code is doing? > > > > > > > > > > > > > > > > > > > > Yes, as far as we can go given the fact that > > > > `processDocuments` is > > > > > > > > > > only called once for any particular document identifier. > > > > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta < > > > > > > rocketra...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > My team is creating a new repository connector. The > > source > > > > > > system > > > > > > > > has > > > > > > > > > > > > a delta API that lets us know of all new, modified, and > > > > deleted > > > > > > > > > > > > individual folders and documents since the last call > > to the > > > > > > API. > > > > > > > > Each > > > > > > > > > > > > call to the delta API provides the changes, as well as > > a > > > > token > > > > > > > > which > > > > > > > > > > > > can be provided on subsequent calls to get changes > > since > > > > that > > > > > > token > > > > > > > > > > > > was generated/returned. > > > > > > > > > > > > > > > > > > > > > > > > What is the best approach to building a repo connector > > to a > > > > > > system > > > > > > > > > > > > that has this type of delta API? > > > > > > > > > > > > > > > > > > > > > > > > Our first design was an implementation that specifies > > > > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then: > > > > > > > > > > > > > > > > > > > > > > > > * In addSeedDocuments, on the initial call we seed > > every > > > > > > document > > > > > > > > in > > > > > > > > > > > > the source system. On subsequent calls, we use the > > delta > > > > API to > > > > > > > > seed > > > > > > > > > > > > every added, modified, or deleted file. We return the > > > > delta API > > > > > > > > token > > > > > > > > > > > > as the version value of addSeedDocuments, so that it > > an be > > > > > > used on > > > > > > > > > > > > subsequent calls. > > > > > > > > > > > > > > > > > > > > > > > > * In processDocuments, we do the usual thing for each > > > > document > > > > > > > > > > identifier. > > > > > > > > > > > > > > > > > > > > > > > > On prototyping, this works for new docs, but > > > > > > "processDocuments" is > > > > > > > > > > > > never triggered for modified and deleted docs. > > > > > > > > > > > > > > > > > > > > > > > > A second design we are considering is to use > > > > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have > > addSeedDocuments > > > > > > return > > > > > > > > only > > > > > > > > > > > > one "virtual" document, which represents the root of > > the > > > > remote > > > > > > > > repo. > > > > > > > > > > > > > > > > > > > > > > > > Then, in "processDocuments" the new "document" is used > > to > > > > > > determine > > > > > > > > > > > > all the child documents of that delta call, which are > > then > > > > > > added to > > > > > > > > > > > > the queue via `activities.addDocumentReference`. To > > force > > > > the > > > > > > > > "virtual > > > > > > > > > > > > seed" to trigger processDocuments again on the next > > call to > > > > > > > > > > > > `addSeedDocuments`, we do > > > > > > > > `activities.deleteDocument(virtualDocId)` as > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > With this alternative design, the stage 1 seed > > effectively > > > > > > becomes > > > > > > > > a > > > > > > > > > > > > no-op, and is just used as a mechanism to trigger > > stage 2. > > > > > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Raman Gupta > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >