Re: Repository connector for source with with delta API

Karl Wright Mon, 27 May 2019 14:58:36 -0700

(1) There should be no new tables needed for any of this.  Your seed list
can be stored in the job specification information.  See the rss connector
for simple example of how this might be done.


(2) If you have switched to MODEL_ALL then all you need to do is provide a
mechanism for any given document for determining which seed it comes from,
and simply look for that in the job specification.  If not there, call
activities.removeDocument().

Karl


On Mon, May 27, 2019 at 5:16 PM Raman Gupta <rocketra...@gmail.com> wrote:

> One seed per job is an interesting approach but in the interests of
> fully understanding the alternatives let me consider choice #2.
>
> >  you might want to combine this all into one job, but then you would
> need to link your documents somehow to the seed they came from, so that if
> the seed was no longer part of the job specification, it could always be
> detected as a deletion.
>
> There are good reasons for me to prefer a single job, so how would I
> accomplish this? Should my connector create its own tables and manage
> this state there? Or is there another more light-weight approach?
>
> > Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE,
> because under that scheme you'd need to *detect* the deletion, because you
> wouldn't be told by the repository that somebody had changed the
> configuration.
>
> That is fine and I understand completely -- I forgot to mention in my
> previous message that I've already switched to MODEL_ALL, and am
> detecting and providing the list of currently active seeds on every
> call to addSeedDocuments.
>
> Regards,
> Raman
>
> On Mon, May 27, 2019 at 4:55 PM Karl Wright <daddy...@gmail.com> wrote:
> >
> > This is very different from the design you originally told me you were
> > going to do.
> >
> > Generally, using hopcounts for managing your documents is a bad practice;
> > this is expensive to do and almost always yields unexpected results.
> > You could have one job per seed, which means all you have to do to make
> the
> > seed go away is delete the job corresponding to it.  If you have way too
> > many seeds for that, you might want to combine this all into one job, but
> > then you would need to link your documents somehow to the seed they came
> > from, so that if the seed was no longer part of the job specification, it
> > could always be detected as a deletion.  Unfortunately, this is
> > inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme
> you'd
> > need to *detect* the deletion, because you wouldn't be told by the
> > repository that somebody had changed the configuration.
> >
> > So two choices: (1) Exactly one seed per job, or (2) don't use
> > MODEL_ADD_CHANGE_DELETE.
> >
> > Karl
> >
> >
> > On Mon, May 27, 2019 at 4:38 PM Raman Gupta <rocketra...@gmail.com>
> wrote:
> >
> > > Thanks for your help Karl. So I think I'm converging on a design.
> > > First of all, per your recommendation, I've switched to scheduled
> > > crawl and it executes as expected every minute with the "schedule
> > > window anytime" setting.
> > >
> > > My next problem is dealing with seed deletion. My upstream source
> > > actually has multiple "roots" i.e. each root has its own set of
> > > documents, and the delta API must be called once for each "root". To
> > > deal with this, I'm specifying each "root" as  a "seed document", and
> > > each such root/seed creates "contained_in" documents. It is also
> > > possible for a "root" to be deleted by a user of the upstream system.
> > >
> > > My job is defined with an accurate hopcount as follows:
> > >
> > > "job": {
> > >   ... snip naming, scheduling, output connectors, doc spec....
> > >   "hopcount_mode" to "accurate"
> > >   "hopcount" to json {
> > >     "link_type" to "contained_in"
> > >     "count" to 1
> > >   },
> > >
> > > For each seed, in processDocuments I am doing:
> > >
> > > activities.addDocumentReference("... doc identifier ...",
> > > seedDocumentIdentifier, "contained_in");
> > >
> > > and then this triggers processDocuments for each of those documents,
> > > as expected.
> > >
> > > How do I code the connector such that I can now remove the documents
> > > that are now unreachable due to the deleted seed? I don't see any
> > > calls to `processDocuments` via the framework that would allow me to
> > > do this.
> > >
> > > Regards,
> > > Raman
> > >
> > >
> > > On Fri, May 24, 2019 at 7:29 PM Karl Wright <daddy...@gmail.com>
> wrote:
> > > >
> > > > Hi Raman,
> > > >
> > > > (1) Continuous crawl is not a good model for you.  It's meant for
> > > crawling
> > > > large web domains, not the kind of task you are doing.
> > > > (2) Scheduled crawl will work fine for you if you simply tell it
> "start
> > > > within schedule window" and make sure your schedule completely covers
> > > 7x24
> > > > times.  So you can do this with one record, which triggers on every
> day
> > > of
> > > > the week, that has a schedule window of 24 hours.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <rocketra...@gmail.com>
> > > wrote:
> > > >
> > > > > Yes, we are indeed running it in continuous crawl mode. Scheduled
> mode
> > > > > works, but given we have a delta API, we thought this is what makes
> > > > > sense, as the delta API is efficient and we don't need to wait an
> > > > > entire day for a scheduled job to run. I see that if I change
> recrawl
> > > > > interval and max recrawl interval also to 1 minute, then my
> documents
> > > > > do get processed each time. However, now we have the opposite
> problem:
> > > > > now the documents are reprocessed every minute, regardless of
> whether
> > > > > they were reseeded or not, which makes no sense to me. If I am
> using
> > > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed
> method,
> > > > > then why are the same documents being reprocessed over and over? I
> > > > > have sent the output to the NullOutput using
> > > > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > > > same documents are repeatedly sent to processDocuments.
> > > > >
> > > > > I just want to process the particular documents I specify on each
> > > > > iteration every 60 seconds -- no more, no less, and yet I seem
> unable
> > > > > to build a connector that does this.
> > > > >
> > > > > If I move to a non-contiguous mode, do I really have to create 1440
> > > > > schedule objects, one for each minute of each day? The way the
> > > > > schedule seems to be put together, I don't see a way to just
> schedule
> > > > > every minute with one schedule. I would have expected schedules to
> > > > > just use cron expressions.
> > > > >
> > > > > If I move to the design #2 in my OP and have one "virtual
> document" to
> > > > > just avoid the seeding stage all-together, then is there some place
> > > > > where I can store the delta token state? Or does my connector have
> to
> > > > > create its own db table to store this?
> > > > >
> > > > > Regards,
> > > > > Raman
> > > > >
> > > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <daddy...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > > > >
> > > > > > You were saying that every minute a addSeedDocuments is being
> called,
> > > > > > correct?  It sounds to me like you are running this job in
> continuous
> > > > > crawl
> > > > > > mode.  Can you try running the job in non-continuous mode, and
> just
> > > > > > repeating the job run once it completes?
> > > > > >
> > > > > > The reason I ask is because continuous crawling has very unique
> > > kinds of
> > > > > > ways of dealing with documents it has crawled.  It uses
> "exponential
> > > > > > backoff" to schedule the next document crawl and that is probably
> > > why you
> > > > > > see the documents in the queue but not being processed; you
> simply
> > > > > haven't
> > > > > > waited long enough.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <
> rocketra...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Here are my addSeedDocuments and processDocuments methods
> > > simplifying
> > > > > > > them down to the minimum necessary to show what is happening:
> > > > > > >
> > > > > > > @Override
> > > > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > > > Specification
> > > > > > > spec,
> > > > > > >                                String lastSeedVersion, long
> > > seedTime,
> > > > > > > int jobMode)
> > > > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > > > {
> > > > > > >   // return the same 3 docs every time, simulating an initial
> > > load, and
> > > > > > > then
> > > > > > >   // these 3 docs changing constantly
> > > > > > >   System.out.println(String.format("-=-= SeedTime=%s",
> seedTime));
> > > > > > >   activities.addSeedDocument("100');
> > > > > > >   activities.addSeedDocument("110');
> > > > > > >   activities.addSeedDocument("120');
> > > > > > >   System.out.println("SEEDING DONE");
> > > > > > >   return null
> > > > > > > }
> > > > > > >
> > > > > > > @Override
> > > > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > > > IExistingVersions statuses, Specification spec,
> > > > > > >                              IProcessActivity activities, int
> > > jobMode,
> > > > > > > boolean usesDefaultAuthority)
> > > > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > > > Arrays.deepToString(documentIdentifiers) );
> > > > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > > > >   //}
> > > > > > >
> > > > > > >   // I've commented out all subsequent logic here, but adding
> the
> > > call
> > > > > to
> > > > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > > > version, documentUri, rd);
> > > > > > >   // does not change anything
> > > > > > > }
> > > > > > >
> > > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > > > >
> > > > > > > -=-= SeedTime=1558733436082
> > > > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > > > -=-= SeedTime=1558733549367
> > > > > > > -=-= SeedTime=1558733609384
> > > > > > > -=-= SeedTime=1558733436082
> > > > > > > etc.
> > > > > > >
> > > > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> > > then
> > > > > > > never again, even though "SEEDING DONE" is printing every
> minute.
> > > If
> > > > > > > and only if I uncomment the for loop which deletes the
> documents
> > > does
> > > > > > > "processDocuments" get called again for those seed document
> ids.
> > > > > > >
> > > > > > > I do note that the queue shows documents 100, 110, and 120 in
> state
> > > > > > > "Waiting for processing", and nothing I do seems to affect
> that.
> > > The
> > > > > > > database update in JobQueue.updateExistingRecordInitial is a
> no-op
> > > for
> > > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY
> and
> > > the
> > > > > > > update does not actually change anything in the db.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Raman
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <
> daddy...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > For any given job run, all documents that are added via
> > > > > > > addSeedDocuments()
> > > > > > > > should be processed.  There is no magic in the framework that
> > > somehow
> > > > > > > knows
> > > > > > > > that a document has been created vs. modified vs. deleted
> until
> > > > > > > > processDocuments() is called.  If your claim is that this
> > > contract
> > > > > is not
> > > > > > > > being honored, could you try changing your connector model to
> > > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything
> seems to
> > > > > work
> > > > > > > > using that model.  If it does *not* then clearly you've got
> some
> > > > > kind of
> > > > > > > > implementation problem at the addSeedDocuments() level
> because
> > > most
> > > > > of
> > > > > > > the
> > > > > > > > Manifold connectors use that model.
> > > > > > > >
> > > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step
> is
> > > to
> > > > > figure
> > > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > > > >
> > > > > > > > Karl
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> > > rocketra...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> > > daddy...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > For ADD_CHANGE_DELETE, the contract for
> addSeedDocuments()
> > > > > basically
> > > > > > > says
> > > > > > > > > > that you have to include *at least* the documents that
> were
> > > > > changed,
> > > > > > > > > added,
> > > > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > > > provided, it
> > > > > > > > > should
> > > > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > > > >
> > > > > > > > > Yes, the delta API gives us all the changed, added, and
> deleted
> > > > > > > > > documents, and those are exactly the ones that we are
> > > including.
> > > > > > > > >
> > > > > > > > > > If you are, the next thing to look at is the computation
> of
> > > the
> > > > > > > version
> > > > > > > > > > string.  The version string is what is used to figure
> out if
> > > a
> > > > > change
> > > > > > > > > took
> > > > > > > > > > place.  You need this IN ADDITION TO the
> addSeedDocuments()
> > > > > doing the
> > > > > > > > > right
> > > > > > > > > > thing.  For deleted documents, obviously the
> > > processDocuments()
> > > > > > > should
> > > > > > > > > call
> > > > > > > > > > the activities.deleteDocument() method.
> > > > > > > > >
> > > > > > > > > The version String is calculated by `processDocuments`.
> Since
> > > after
> > > > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > > > `processDocuments` is never called again for that document,
> > > even
> > > > > > > > > though it has been modified to document A version 2.
> > > Therefore, our
> > > > > > > > > connector never gets a chance to return the "version 2"
> string.
> > > > > > > > >
> > > > > > > > > > Does this sound like what your code is doing?
> > > > > > > > >
> > > > > > > > > Yes, as far as we can go given the fact that
> > > `processDocuments` is
> > > > > > > > > only called once for any particular document identifier.
> > > > > > > > >
> > > > > > > > > > Karl
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > > > rocketra...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > My team is creating a new repository connector. The
> source
> > > > > system
> > > > > > > has
> > > > > > > > > > > a delta API that lets us know of all new, modified, and
> > > deleted
> > > > > > > > > > > individual folders and documents since the last call
> to the
> > > > > API.
> > > > > > > Each
> > > > > > > > > > > call to the delta API provides the changes, as well as
> a
> > > token
> > > > > > > which
> > > > > > > > > > > can be provided on subsequent calls to get changes
> since
> > > that
> > > > > token
> > > > > > > > > > > was generated/returned.
> > > > > > > > > > >
> > > > > > > > > > > What is the best approach to building a repo connector
> to a
> > > > > system
> > > > > > > > > > > that has this type of delta API?
> > > > > > > > > > >
> > > > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > > > >
> > > > > > > > > > > * In addSeedDocuments, on the initial call we seed
> every
> > > > > document
> > > > > > > in
> > > > > > > > > > > the source system. On subsequent calls, we use the
> delta
> > > API to
> > > > > > > seed
> > > > > > > > > > > every added, modified, or deleted file. We return the
> > > delta API
> > > > > > > token
> > > > > > > > > > > as the version value of addSeedDocuments, so that it
> an be
> > > > > used on
> > > > > > > > > > > subsequent calls.
> > > > > > > > > > >
> > > > > > > > > > > * In processDocuments, we do the usual thing for each
> > > document
> > > > > > > > > identifier.
> > > > > > > > > > >
> > > > > > > > > > > On prototyping, this works for new docs, but
> > > > > "processDocuments" is
> > > > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > > > >
> > > > > > > > > > > A second design we are considering is to use
> > > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have
> addSeedDocuments
> > > > > return
> > > > > > > only
> > > > > > > > > > > one "virtual" document, which represents the root of
> the
> > > remote
> > > > > > > repo.
> > > > > > > > > > >
> > > > > > > > > > > Then, in "processDocuments" the new "document" is used
> to
> > > > > determine
> > > > > > > > > > > all the child documents of that delta call, which are
> then
> > > > > added to
> > > > > > > > > > > the queue via `activities.addDocumentReference`. To
> force
> > > the
> > > > > > > "virtual
> > > > > > > > > > > seed" to trigger processDocuments again on the next
> call to
> > > > > > > > > > > `addSeedDocuments`, we do
> > > > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > > > well.
> > > > > > > > > > >
> > > > > > > > > > > With this alternative design, the stage 1 seed
> effectively
> > > > > becomes
> > > > > > > a
> > > > > > > > > > > no-op, and is just used as a mechanism to trigger
> stage 2.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Raman Gupta
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Reply via email to