Re: Repository connector for source with with delta API

Raman Gupta Mon, 27 May 2019 15:56:05 -0700

On Mon, May 27, 2019 at 5:58 PM Karl Wright <daddy...@gmail.com> wrote:
>
> (1) There should be no new tables needed for any of this.  Your seed list
> can be stored in the job specification information.  See the rss connector
> for simple example of how this might be done.


Are you assuming the seed list is static? The RSS connector only
changes the job specification in the `processSpecificationPost`
method, and I assume that the job spec is read-only in
`addSeedDocuments`.

As I thought I made clear in previous messages, my seed list is
dynamic, which is why I switched to MODEL_ALL -- on each call to
addSeedDocuments, I can dynamically determine which seeds are
relevant, and only provide that list. To provide additional context,
the dynamic seed list is based on regular expression matches, where
the underlying seeds/roots can come and go based on which ones match
the regexes, and the regexes are present in the document spec.

> (2) If you have switched to MODEL_ALL then all you need to do is provide a
> mechanism for any given document for determining which seed it comes from,
> and simply look for that in the job specification.  If not there, call
> activities.removeDocument().

See above.

Regards,
Raman

> Karl
>
>
> On Mon, May 27, 2019 at 5:16 PM Raman Gupta <rocketra...@gmail.com> wrote:
>
> > One seed per job is an interesting approach but in the interests of
> > fully understanding the alternatives let me consider choice #2.
> >
> > >  you might want to combine this all into one job, but then you would
> > need to link your documents somehow to the seed they came from, so that if
> > the seed was no longer part of the job specification, it could always be
> > detected as a deletion.
> >
> > There are good reasons for me to prefer a single job, so how would I
> > accomplish this? Should my connector create its own tables and manage
> > this state there? Or is there another more light-weight approach?
> >
> > > Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE,
> > because under that scheme you'd need to *detect* the deletion, because you
> > wouldn't be told by the repository that somebody had changed the
> > configuration.
> >
> > That is fine and I understand completely -- I forgot to mention in my
> > previous message that I've already switched to MODEL_ALL, and am
> > detecting and providing the list of currently active seeds on every
> > call to addSeedDocuments.
> >
> > Regards,
> > Raman
> >
> > On Mon, May 27, 2019 at 4:55 PM Karl Wright <daddy...@gmail.com> wrote:
> > >
> > > This is very different from the design you originally told me you were
> > > going to do.
> > >
> > > Generally, using hopcounts for managing your documents is a bad practice;
> > > this is expensive to do and almost always yields unexpected results.
> > > You could have one job per seed, which means all you have to do to make
> > the
> > > seed go away is delete the job corresponding to it.  If you have way too
> > > many seeds for that, you might want to combine this all into one job, but
> > > then you would need to link your documents somehow to the seed they came
> > > from, so that if the seed was no longer part of the job specification, it
> > > could always be detected as a deletion.  Unfortunately, this is
> > > inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme
> > you'd
> > > need to *detect* the deletion, because you wouldn't be told by the
> > > repository that somebody had changed the configuration.
> > >
> > > So two choices: (1) Exactly one seed per job, or (2) don't use
> > > MODEL_ADD_CHANGE_DELETE.
> > >
> > > Karl
> > >
> > >
> > > On Mon, May 27, 2019 at 4:38 PM Raman Gupta <rocketra...@gmail.com>
> > wrote:
> > >
> > > > Thanks for your help Karl. So I think I'm converging on a design.
> > > > First of all, per your recommendation, I've switched to scheduled
> > > > crawl and it executes as expected every minute with the "schedule
> > > > window anytime" setting.
> > > >
> > > > My next problem is dealing with seed deletion. My upstream source
> > > > actually has multiple "roots" i.e. each root has its own set of
> > > > documents, and the delta API must be called once for each "root". To
> > > > deal with this, I'm specifying each "root" as  a "seed document", and
> > > > each such root/seed creates "contained_in" documents. It is also
> > > > possible for a "root" to be deleted by a user of the upstream system.
> > > >
> > > > My job is defined with an accurate hopcount as follows:
> > > >
> > > > "job": {
> > > >   ... snip naming, scheduling, output connectors, doc spec....
> > > >   "hopcount_mode" to "accurate"
> > > >   "hopcount" to json {
> > > >     "link_type" to "contained_in"
> > > >     "count" to 1
> > > >   },
> > > >
> > > > For each seed, in processDocuments I am doing:
> > > >
> > > > activities.addDocumentReference("... doc identifier ...",
> > > > seedDocumentIdentifier, "contained_in");
> > > >
> > > > and then this triggers processDocuments for each of those documents,
> > > > as expected.
> > > >
> > > > How do I code the connector such that I can now remove the documents
> > > > that are now unreachable due to the deleted seed? I don't see any
> > > > calls to `processDocuments` via the framework that would allow me to
> > > > do this.
> > > >
> > > > Regards,
> > > > Raman
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 7:29 PM Karl Wright <daddy...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi Raman,
> > > > >
> > > > > (1) Continuous crawl is not a good model for you.  It's meant for
> > > > crawling
> > > > > large web domains, not the kind of task you are doing.
> > > > > (2) Scheduled crawl will work fine for you if you simply tell it
> > "start
> > > > > within schedule window" and make sure your schedule completely covers
> > > > 7x24
> > > > > times.  So you can do this with one record, which triggers on every
> > day
> > > > of
> > > > > the week, that has a schedule window of 24 hours.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <rocketra...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Yes, we are indeed running it in continuous crawl mode. Scheduled
> > mode
> > > > > > works, but given we have a delta API, we thought this is what makes
> > > > > > sense, as the delta API is efficient and we don't need to wait an
> > > > > > entire day for a scheduled job to run. I see that if I change
> > recrawl
> > > > > > interval and max recrawl interval also to 1 minute, then my
> > documents
> > > > > > do get processed each time. However, now we have the opposite
> > problem:
> > > > > > now the documents are reprocessed every minute, regardless of
> > whether
> > > > > > they were reseeded or not, which makes no sense to me. If I am
> > using
> > > > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed
> > method,
> > > > > > then why are the same documents being reprocessed over and over? I
> > > > > > have sent the output to the NullOutput using
> > > > > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > > > > same documents are repeatedly sent to processDocuments.
> > > > > >
> > > > > > I just want to process the particular documents I specify on each
> > > > > > iteration every 60 seconds -- no more, no less, and yet I seem
> > unable
> > > > > > to build a connector that does this.
> > > > > >
> > > > > > If I move to a non-contiguous mode, do I really have to create 1440
> > > > > > schedule objects, one for each minute of each day? The way the
> > > > > > schedule seems to be put together, I don't see a way to just
> > schedule
> > > > > > every minute with one schedule. I would have expected schedules to
> > > > > > just use cron expressions.
> > > > > >
> > > > > > If I move to the design #2 in my OP and have one "virtual
> > document" to
> > > > > > just avoid the seeding stage all-together, then is there some place
> > > > > > where I can store the delta token state? Or does my connector have
> > to
> > > > > > create its own db table to store this?
> > > > > >
> > > > > > Regards,
> > > > > > Raman
> > > > > >
> > > > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > > > > >
> > > > > > > You were saying that every minute a addSeedDocuments is being
> > called,
> > > > > > > correct?  It sounds to me like you are running this job in
> > continuous
> > > > > > crawl
> > > > > > > mode.  Can you try running the job in non-continuous mode, and
> > just
> > > > > > > repeating the job run once it completes?
> > > > > > >
> > > > > > > The reason I ask is because continuous crawling has very unique
> > > > kinds of
> > > > > > > ways of dealing with documents it has crawled.  It uses
> > "exponential
> > > > > > > backoff" to schedule the next document crawl and that is probably
> > > > why you
> > > > > > > see the documents in the queue but not being processed; you
> > simply
> > > > > > haven't
> > > > > > > waited long enough.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <
> > rocketra...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Here are my addSeedDocuments and processDocuments methods
> > > > simplifying
> > > > > > > > them down to the minimum necessary to show what is happening:
> > > > > > > >
> > > > > > > > @Override
> > > > > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > > > > Specification
> > > > > > > > spec,
> > > > > > > >                                String lastSeedVersion, long
> > > > seedTime,
> > > > > > > > int jobMode)
> > > > > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > > > > {
> > > > > > > >   // return the same 3 docs every time, simulating an initial
> > > > load, and
> > > > > > > > then
> > > > > > > >   // these 3 docs changing constantly
> > > > > > > >   System.out.println(String.format("-=-= SeedTime=%s",
> > seedTime));
> > > > > > > >   activities.addSeedDocument("100');
> > > > > > > >   activities.addSeedDocument("110');
> > > > > > > >   activities.addSeedDocument("120');
> > > > > > > >   System.out.println("SEEDING DONE");
> > > > > > > >   return null
> > > > > > > > }
> > > > > > > >
> > > > > > > > @Override
> > > > > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > > > > IExistingVersions statuses, Specification spec,
> > > > > > > >                              IProcessActivity activities, int
> > > > jobMode,
> > > > > > > > boolean usesDefaultAuthority)
> > > > > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > > > > Arrays.deepToString(documentIdentifiers) );
> > > > > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > > > > >   //}
> > > > > > > >
> > > > > > > >   // I've commented out all subsequent logic here, but adding
> > the
> > > > call
> > > > > > to
> > > > > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > > > > version, documentUri, rd);
> > > > > > > >   // does not change anything
> > > > > > > > }
> > > > > > > >
> > > > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > > > > >
> > > > > > > > -=-= SeedTime=1558733436082
> > > > > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > > > > -=-= SeedTime=1558733549367
> > > > > > > > -=-= SeedTime=1558733609384
> > > > > > > > -=-= SeedTime=1558733436082
> > > > > > > > etc.
> > > > > > > >
> > > > > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> > > > then
> > > > > > > > never again, even though "SEEDING DONE" is printing every
> > minute.
> > > > If
> > > > > > > > and only if I uncomment the for loop which deletes the
> > documents
> > > > does
> > > > > > > > "processDocuments" get called again for those seed document
> > ids.
> > > > > > > >
> > > > > > > > I do note that the queue shows documents 100, 110, and 120 in
> > state
> > > > > > > > "Waiting for processing", and nothing I do seems to affect
> > that.
> > > > The
> > > > > > > > database update in JobQueue.updateExistingRecordInitial is a
> > no-op
> > > > for
> > > > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY
> > and
> > > > the
> > > > > > > > update does not actually change anything in the db.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Raman
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <
> > daddy...@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > For any given job run, all documents that are added via
> > > > > > > > addSeedDocuments()
> > > > > > > > > should be processed.  There is no magic in the framework that
> > > > somehow
> > > > > > > > knows
> > > > > > > > > that a document has been created vs. modified vs. deleted
> > until
> > > > > > > > > processDocuments() is called.  If your claim is that this
> > > > contract
> > > > > > is not
> > > > > > > > > being honored, could you try changing your connector model to
> > > > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything
> > seems to
> > > > > > work
> > > > > > > > > using that model.  If it does *not* then clearly you've got
> > some
> > > > > > kind of
> > > > > > > > > implementation problem at the addSeedDocuments() level
> > because
> > > > most
> > > > > > of
> > > > > > > > the
> > > > > > > > > Manifold connectors use that model.
> > > > > > > > >
> > > > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step
> > is
> > > > to
> > > > > > figure
> > > > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > > > > >
> > > > > > > > > Karl
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> > > > rocketra...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> > > > daddy...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > For ADD_CHANGE_DELETE, the contract for
> > addSeedDocuments()
> > > > > > basically
> > > > > > > > says
> > > > > > > > > > > that you have to include *at least* the documents that
> > were
> > > > > > changed,
> > > > > > > > > > added,
> > > > > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > > > > provided, it
> > > > > > > > > > should
> > > > > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > > > > >
> > > > > > > > > > Yes, the delta API gives us all the changed, added, and
> > deleted
> > > > > > > > > > documents, and those are exactly the ones that we are
> > > > including.
> > > > > > > > > >
> > > > > > > > > > > If you are, the next thing to look at is the computation
> > of
> > > > the
> > > > > > > > version
> > > > > > > > > > > string.  The version string is what is used to figure
> > out if
> > > > a
> > > > > > change
> > > > > > > > > > took
> > > > > > > > > > > place.  You need this IN ADDITION TO the
> > addSeedDocuments()
> > > > > > doing the
> > > > > > > > > > right
> > > > > > > > > > > thing.  For deleted documents, obviously the
> > > > processDocuments()
> > > > > > > > should
> > > > > > > > > > call
> > > > > > > > > > > the activities.deleteDocument() method.
> > > > > > > > > >
> > > > > > > > > > The version String is calculated by `processDocuments`.
> > Since
> > > > after
> > > > > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > > > > `processDocuments` is never called again for that document,
> > > > even
> > > > > > > > > > though it has been modified to document A version 2.
> > > > Therefore, our
> > > > > > > > > > connector never gets a chance to return the "version 2"
> > string.
> > > > > > > > > >
> > > > > > > > > > > Does this sound like what your code is doing?
> > > > > > > > > >
> > > > > > > > > > Yes, as far as we can go given the fact that
> > > > `processDocuments` is
> > > > > > > > > > only called once for any particular document identifier.
> > > > > > > > > >
> > > > > > > > > > > Karl
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > > > > rocketra...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > My team is creating a new repository connector. The
> > source
> > > > > > system
> > > > > > > > has
> > > > > > > > > > > > a delta API that lets us know of all new, modified, and
> > > > deleted
> > > > > > > > > > > > individual folders and documents since the last call
> > to the
> > > > > > API.
> > > > > > > > Each
> > > > > > > > > > > > call to the delta API provides the changes, as well as
> > a
> > > > token
> > > > > > > > which
> > > > > > > > > > > > can be provided on subsequent calls to get changes
> > since
> > > > that
> > > > > > token
> > > > > > > > > > > > was generated/returned.
> > > > > > > > > > > >
> > > > > > > > > > > > What is the best approach to building a repo connector
> > to a
> > > > > > system
> > > > > > > > > > > > that has this type of delta API?
> > > > > > > > > > > >
> > > > > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > > > > >
> > > > > > > > > > > > * In addSeedDocuments, on the initial call we seed
> > every
> > > > > > document
> > > > > > > > in
> > > > > > > > > > > > the source system. On subsequent calls, we use the
> > delta
> > > > API to
> > > > > > > > seed
> > > > > > > > > > > > every added, modified, or deleted file. We return the
> > > > delta API
> > > > > > > > token
> > > > > > > > > > > > as the version value of addSeedDocuments, so that it
> > an be
> > > > > > used on
> > > > > > > > > > > > subsequent calls.
> > > > > > > > > > > >
> > > > > > > > > > > > * In processDocuments, we do the usual thing for each
> > > > document
> > > > > > > > > > identifier.
> > > > > > > > > > > >
> > > > > > > > > > > > On prototyping, this works for new docs, but
> > > > > > "processDocuments" is
> > > > > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > > > > >
> > > > > > > > > > > > A second design we are considering is to use
> > > > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have
> > addSeedDocuments
> > > > > > return
> > > > > > > > only
> > > > > > > > > > > > one "virtual" document, which represents the root of
> > the
> > > > remote
> > > > > > > > repo.
> > > > > > > > > > > >
> > > > > > > > > > > > Then, in "processDocuments" the new "document" is used
> > to
> > > > > > determine
> > > > > > > > > > > > all the child documents of that delta call, which are
> > then
> > > > > > added to
> > > > > > > > > > > > the queue via `activities.addDocumentReference`. To
> > force
> > > > the
> > > > > > > > "virtual
> > > > > > > > > > > > seed" to trigger processDocuments again on the next
> > call to
> > > > > > > > > > > > `addSeedDocuments`, we do
> > > > > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > > > > well.
> > > > > > > > > > > >
> > > > > > > > > > > > With this alternative design, the stage 1 seed
> > effectively
> > > > > > becomes
> > > > > > > > a
> > > > > > > > > > > > no-op, and is just used as a mechanism to trigger
> > stage 2.
> > > > > > > > > > > >
> > > > > > > > > > > > Thoughts?
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Raman Gupta
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Reply via email to