Hi Robert,

I've found that it's not as simple. There is still some factor of
randomness attached to this issue. After doing the bisect more times, I've
found that commit 0a13d3467aa78b46ec33ae5687418685f90a9e12 seems to work
*most* of the time. There are still times where I get the error, but it is
recoverable on the next run.

Carlos

On Thu, Mar 19, 2020 at 6:21 AM Robert Munteanu <romb...@apache.org> wrote:

> That's good info, thank you! I've added some details to the Jira issue.
> I tried reverting the commits I suspect are at fault
>
> - https://github.com/apache/sling-org-apache-sling-jcr-base/commit/6f5771a
> - https://github.com/apache/sling-org-apache-sling-jcr-base/commit/3de2b9f
>
> But that failed due to conflicts. I will try and manually remove the
> changes and see what that does.
> Robert
>
> On Wed, 2020-03-18 at 21:24 -0400, Carlos Munoz wrote:
> > I went through the bisect process and I got the first bad commit:
> >
> > commit bb1e10d97f3c163fb87917ea782afff674050891
> > Author: Eric Norman <enor...@apache.org>
> > Date:   Sun Dec 16 12:33:08 2018 -0800
> >
> >     switch to released JCR Base 3.0.6
> >
> > (I tried it a couple of times just to be sure)
> >
> > I tried running our app with the commit before that and I get it to
> > run.
> > (There are other unrelated problems).
> >
> >
> > On Mon, Mar 16, 2020 at 6:12 PM Robert Munteanu <romb...@apache.org>
> > wrote:
> >
> > > Hi Carlos,
> > >
> > > Apologies for the delay ...
> > >
> > > What I was thinking of doing myself, but did not have the time is
> > > the
> > > following
> > >
> > > 1. Find a version of Sling for which the scenario in SLING-9118
> > > works.
> > > Perhaps Sling Starter 11 is a good start.
> > > 2. Run a `git bisect` check between sling starter 11 and the
> > > current
> > > master branch
> > >
> > > Assuming my guess is correct, git would say
> > >
> > > Bisecting: 36 revisions left to test after this (roughly 5 steps)
> > > [c1aedf7b292f7835ceb4e2f56fedcb3294c60756] Update to Tika 1.21
> > >
> > > So not that many steps to test.
> > >
> > > If you would manage to isolate the change to the starter that broke
> > > this, it would make it much easier to understand where the problem
> > > is
> > > coming from.
> > >
> > > Thanks!
> > > Robert
> > >
> > > On Mon, 2020-03-16 at 16:27 -0400, Carlos Munoz wrote:
> > > > Hi Robert,
> > > >
> > > > Just a friendly ping about this issue :)
> > > >
> > > > We could try to submit a fix with some potential guidance from
> > > > you.
> > > > For
> > > > example, which of the many Sling bundles should we start looking
> > > > at?
> > > >
> > > > Regards,
> > > >
> > > > Carlos
> > > >
> > > >
> > > > On Wed, Feb 26, 2020 at 7:24 AM Carlos Munoz <camu...@redhat.com>
> > > > wrote:
> > > >
> > > > > Thanks Robert. As always your help is appreciated.
> > > > >
> > > > > On Fri, Feb 21, 2020 at 6:28 PM Robert Munteanu <
> > > > > romb...@apache.org
> > > > > wrote:
> > > > >
> > > > > > Thanks, Ben,
> > > > > >
> > > > > > I added a bit more detail, based on our mailing list
> > > > > > conversations.
> > > > > > I'll have limited access in the next two weeks, but if no one
> > > > > > picks it
> > > > > > up I'll look into it when I get back.
> > > > > >
> > > > > > Thanks,
> > > > > > Robert
> > > > > >
> > > > > > On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote:
> > > > > > > I went ahead and created
> > > > > > > https://issues.apache.org/jira/browse/SLING-9118
> > > > > > > for this. Although the ultimate goal here is
> > > > > > > containerization,
> > > > > > > I
> > > > > > > neglected
> > > > > > > to include any details to that effect in the ticket, since
> > > > > > > the
> > > > > > > behavior is
> > > > > > > reproducible without that being a complicating factor.
> > > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu <
> > > > > > > romb...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote:
> > > > > > > > > I am following along conceptually - I want to make sure
> > > > > > > > > I
> > > > > > > > > understand
> > > > > > > > > what's
> > > > > > > > > being described.
> > > > > > > > >
> > > > > > > > > Let's say Sling Instance A starts successfully the
> > > > > > > > > first
> > > > > > > > > time. If
> > > > > > > > > we
> > > > > > > > > restart Sling Instance A, we expect subsequent restarts
> > > > > > > > > to
> > > > > > > > > also
> > > > > > > > > succeed,
> > > > > > > > > without removing the sling directory.
> > > > > > > > > Now let's say Sling Instance B does NOT start
> > > > > > > > > successfully
> > > > > > > > > the
> > > > > > > > > first
> > > > > > > > > time.
> > > > > > > > > Despite that, we expect subsequent restarts to succeed
> > > > > > > > > without
> > > > > > > > > removing the
> > > > > > > > > sling directory.
> > > > > > > > >
> > > > > > > > > Correct so far?
> > > > > > > >
> > > > > > > > Yes, correct.
> > > > > > > >
> > > > > > > > > Assuming yes... what if this is running in k8s, and k8s
> > > > > > > > > sees that
> > > > > > > > > Sling
> > > > > > > > > Instance B did not start successfully, and kills the
> > > > > > > > > pod
> > > > > > > > > (removing
> > > > > > > > > all pod
> > > > > > > > > resources, including that pod's sling directory) in
> > > > > > > > > response?
> > > > > > > > > Presumably,
> > > > > > > > > k8s would then start Sling Instance C, which is a fresh
> > > > > > > > > instance
> > > > > > > > > with
> > > > > > > > > no
> > > > > > > > > sling directory. Are we saying we expect C to have a
> > > > > > > > > 50/50
> > > > > > > > > chance
> > > > > > > > > of
> > > > > > > > > starting successfully? Or have we observed different
> > > > > > > > > behavior?
> > > > > > > >
> > > > > > > > I think that only the first instance starts successfully.
> > > > > > > > Additional
> > > > > > > > instances will not start unless they have a Sling
> > > > > > > > directory
> > > > > > > > set up.
> > > > > > > >
> > > > > > > > I've tested with a third instance, once two instances are
> > > > > > > > up,
> > > > > > > > and
> > > > > > > > it
> > > > > > > > has the exact same behaviour.
> > > > > > > >
> > > > > > > > One workaround that I can suggest for a containerized
> > > > > > > > environment
> > > > > > > > is to
> > > > > > > > use a supervisor script that detects the abnormal startup
> > > > > > > > problem
> > > > > > > > and
> > > > > > > > restarts Sling, so that it starts up successfully.
> > > > > > > >
> > > > > > > > Another would be to persist the 'sling' directory as a
> > > > > > > > per-
> > > > > > > > container
> > > > > > > > volume. Not sure how easy that is with k8s, but maybe you
> > > > > > > > can
> > > > > > > > use a
> > > > > > > > single ReadWriteMany volume at /sling, and each pod gets
> > > > > > > > their own
> > > > > > > > ${sling.home} at /sling/${containerId} ( assuming that is
> > > > > > > > exposed
> > > > > > > > through the downward API).
> > > > > > > >
> > > > > > > > As these are workardounds, I would still very much like
> > > > > > > > to
> > > > > > > > see this
> > > > > > > > fixed properly, so please file a bug to track this.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Robert
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Ben
> > > > > > > > >
> > > > > > > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <
> > > > > > > > > camu...@redhat.com
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the information Robert.
> > > > > > > > > >
> > > > > > > > > > To replicate the issue all I needed was a mongodb (I
> > > > > > > > > > used
> > > > > > > > > > a
> > > > > > > > > > full
> > > > > > > > > > replica
> > > > > > > > > > set, see my instructions in a previous email about
> > > > > > > > > > how to
> > > > > > > > > > get
> > > > > > > > > > one
> > > > > > > > > > going
> > > > > > > > > > using podman) and a single process running sling.
> > > > > > > > > >
> > > > > > > > > > The problem does happen when I do the following:
> > > > > > > > > >
> > > > > > > > > > 2. Start Sling instance A, wait for it to start
> > > > > > > > > > 3. Stop Sling instance A, wait for it to stop
> > > > > > > > > > 4. Start Sling instance B - Error
> > > > > > > > > >
> > > > > > > > > > but let me add more
> > > > > > > > > >
> > > > > > > > > > 5. Start Sling Instance A again - Success (note I
> > > > > > > > > > didn't
> > > > > > > > > > remove
> > > > > > > > > > the
> > > > > > > > > > sling
> > > > > > > > > > dir)
> > > > > > > > > > 6. Start Sling instance B again - Success (note I
> > > > > > > > > > didn't
> > > > > > > > > > remove
> > > > > > > > > > the
> > > > > > > > > > sling
> > > > > > > > > > dir)
> > > > > > > > > >
> > > > > > > > > > this means that even if Sling recreates the sling
> > > > > > > > > > directory and
> > > > > > > > > > fails the
> > > > > > > > > > startup, next time it will succeed. Unfortunately we
> > > > > > > > > > don't have
> > > > > > > > > > that luxury
> > > > > > > > > > in containers because the sling directory is not
> > > > > > > > > > persisted.
> > > > > > > > > >
> > > > > > > > > > I think this is a bug, but I'll keep playing with it
> > > > > > > > > > a
> > > > > > > > > > bit to
> > > > > > > > > > see
> > > > > > > > > > if I can
> > > > > > > > > > find out more.
> > > > > > > > > >
> > > > > > > > > > Carlos
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <
> > > > > > > > > > romb...@apache.org
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Robert I managed to replicate the issue in a
> > > > > > > > > > > > local,
> > > > > > > > > > > > non-
> > > > > > > > > > > > containerized
> > > > > > > > > > > > environment (!!!).
> > > > > > > > > > > >
> > > > > > > > > > > > The problem seems to be when the database is kept
> > > > > > > > > > > > but
> > > > > > > > > > > > the
> > > > > > > > > > > > 'sling'
> > > > > > > > > > > > directory
> > > > > > > > > > > > is cleared out across restarts (as it is for us
> > > > > > > > > > > > when
> > > > > > > > > > > > the
> > > > > > > > > > > > container
> > > > > > > > > > > > goes
> > > > > > > > > > > > away). As I said before this doesn't seem to be a
> > > > > > > > > > > > problem
> > > > > > > > > > > > with
> > > > > > > > > > > > the
> > > > > > > > > > > > Sling 11
> > > > > > > > > > > > bundles.
> > > > > > > > > > > >
> > > > > > > > > > > > The first basic solution will be to persist the
> > > > > > > > > > > > 'sling'
> > > > > > > > > > > > directory
> > > > > > > > > > > > across
> > > > > > > > > > > > restarts, and I was wondering if this is a bug,
> > > > > > > > > > > > or as
> > > > > > > > > > > > designed.
> > > > > > > > > > >
> > > > > > > > > > > I think this should work.
> > > > > > > > > > >
> > > > > > > > > > > > I also wonder if once persisted, multiple
> > > > > > > > > > > > containers
> > > > > > > > > > > > could
> > > > > > > > > > > > share this
> > > > > > > > > > > > directory.
> > > > > > > > > > >
> > > > > > > > > > > This directory can't be shared, as it holds runtime
> > > > > > > > > > > data
> > > > > > > > > > > related
> > > > > > > > > > > to
> > > > > > > > > > > Sling. For instance, a bundle that is started in
> > > > > > > > > > > instance A
> > > > > > > > > > > could
> > > > > > > > > > > be
> > > > > > > > > > > starting on instance B.
> > > > > > > > > > >
> > > > > > > > > > > There is at least one file ( sling.id ) that holds
> > > > > > > > > > > data
> > > > > > > > > > > that
> > > > > > > > > > > must
> > > > > > > > > > > not
> > > > > > > > > > > be the same between instances.
> > > > > > > > > > >
> > > > > > > > > > > So I would advise as marking the directory as
> > > > > > > > > > > container-
> > > > > > > > > > > private
> > > > > > > > > > > as a
> > > > > > > > > > > first step.
> > > > > > > > > > >
> > > > > > > > > > > Robert
> > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Carlos
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz <
> > > > > > > > > > > > camu...@redhat.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks Robert (and once again I can't stress
> > > > > > > > > > > > > enough
> > > > > > > > > > > > > how
> > > > > > > > > > > > > grateful I
> > > > > > > > > > > > > am for
> > > > > > > > > > > > > all your help).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Right now we deploy our container with the
> > > > > > > > > > > > > expectation
> > > > > > > > > > > > > that
> > > > > > > > > > > > > the
> > > > > > > > > > > > > mongo db
> > > > > > > > > > > > > is the only necessary state we need to keep;
> > > > > > > > > > > > > everything
> > > > > > > > > > > > > else
> > > > > > > > > > > > > is
> > > > > > > > > > > > > throwaway.
> > > > > > > > > > > > > This means that a totally new container
> > > > > > > > > > > > > connected
> > > > > > > > > > > > > to the
> > > > > > > > > > > > > mongodb
> > > > > > > > > > > > > should
> > > > > > > > > > > > > pick up the state and run the same as the first
> > > > > > > > > > > > > time it
> > > > > > > > > > > > > was
> > > > > > > > > > > > > fired
> > > > > > > > > > > > > up. Do
> > > > > > > > > > > > > you think this is an incorrect assumption? If
> > > > > > > > > > > > > so,
> > > > > > > > > > > > > what
> > > > > > > > > > > > > are
> > > > > > > > > > > > > other
> > > > > > > > > > > > > pieces of
> > > > > > > > > > > > > state we should be keeping for subsequent
> > > > > > > > > > > > > restarts?
> > > > > > > > > > > > >
> > > > > > > > > > > > > This assumption has worked well for us with the
> > > > > > > > > > > > > current
> > > > > > > > > > > > > sling
> > > > > > > > > > > > > 11
> > > > > > > > > > > > > release,
> > > > > > > > > > > > > but it seems to break with the more up-to-date
> > > > > > > > > > > > > bundles.
> > > > > > > > > > > > > Perhaps
> > > > > > > > > > > > > running
> > > > > > > > > > > > > Sling in a container is just not meant to be.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Carlos
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu
> > > > > > > > > > > > > <
> > > > > > > > > > > > > romb...@apache.org
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Carlos,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos
> > > > > > > > > > > > > > Munoz
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Thanks Bertrand. How can I run Sling with
> > > > > > > > > > > > > > > DEBUG-level
> > > > > > > > > > > > > > > logs for
> > > > > > > > > > > > > > > every
> > > > > > > > > > > > > > > bundle? I tried passing a few configuration
> > > > > > > > > > > > > > > arguments
> > > > > > > > > > > > > > > from the
> > > > > > > > > > > > > > > command line
> > > > > > > > > > > > > > > but nothing seemed to work.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Try configuring the LogManager to debug at
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > >
> https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Robert
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Carlos
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand
> > > > > > > > > > > > > > > Delacretaz <
> > > > > > > > > > > > > > > bdelacre...@apache.org>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos
> > > > > > > > > > > > > > > > Munoz
> > > > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > camu...@redhat.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > ...Is there a reason why the Jcr
> > > > > > > > > > > > > > > > > repository
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > restarting?
> > > > > > > > > > > > > > > > > And what
> > > > > > > > > > > > > > > > > class could we start looking into to
> > > > > > > > > > > > > > > > > debug
> > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > this is
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > case?...
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It's not uncommon to see extra restarts
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > OSGi
> > > > > > > > > > > > > > > > components at
> > > > > > > > > > > > > > > > startup,
> > > > > > > > > > > > > > > > for various reasons.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The simplest way to detect and log
> > > > > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > > repository
> > > > > > > > > > > > > > > > startups
> > > > > > > > > > > > > > > > might
> > > > > > > > > > > > > > > > be to implement a
> > > > > > > > > > > > > > > > SlingRepositoryInitializer
> > > > > > > > > > > > > > > > service
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > that's
> > > > > > > > > > > > > > > > called
> > > > > > > > > > > > > > > > at every startup, or use the logs of an
> > > > > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > > like the
> > > > > > > > > > > > > > > > JCR
> > > > > > > > > > > > > > > > RepositoryInitializer [2] if that has
> > > > > > > > > > > > > > > > anything to
> > > > > > > > > > > > > > > > process in
> > > > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > system.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Bertrand
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > >
> https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer
> > > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > > >
> > >
> https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98
>
>

Reply via email to