Hi Robert,

Just a friendly ping about this issue :)

We could try to submit a fix with some potential guidance from you. For
example, which of the many Sling bundles should we start looking at?

Regards,

Carlos


On Wed, Feb 26, 2020 at 7:24 AM Carlos Munoz <camu...@redhat.com> wrote:

> Thanks Robert. As always your help is appreciated.
>
> On Fri, Feb 21, 2020 at 6:28 PM Robert Munteanu <romb...@apache.org>
> wrote:
>
>> Thanks, Ben,
>>
>> I added a bit more detail, based on our mailing list conversations.
>> I'll have limited access in the next two weeks, but if no one picks it
>> up I'll look into it when I get back.
>>
>> Thanks,
>> Robert
>>
>> On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote:
>> > I went ahead and created
>> > https://issues.apache.org/jira/browse/SLING-9118
>> > for this. Although the ultimate goal here is containerization, I
>> > neglected
>> > to include any details to that effect in the ticket, since the
>> > behavior is
>> > reproducible without that being a complicating factor.
>> >
>> > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu <romb...@apache.org>
>> > wrote:
>> >
>> > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote:
>> > > > I am following along conceptually - I want to make sure I
>> > > > understand
>> > > > what's
>> > > > being described.
>> > > >
>> > > > Let's say Sling Instance A starts successfully the first time. If
>> > > > we
>> > > > restart Sling Instance A, we expect subsequent restarts to also
>> > > > succeed,
>> > > > without removing the sling directory.
>> > > > Now let's say Sling Instance B does NOT start successfully the
>> > > > first
>> > > > time.
>> > > > Despite that, we expect subsequent restarts to succeed without
>> > > > removing the
>> > > > sling directory.
>> > > >
>> > > > Correct so far?
>> > >
>> > > Yes, correct.
>> > >
>> > > > Assuming yes... what if this is running in k8s, and k8s sees that
>> > > > Sling
>> > > > Instance B did not start successfully, and kills the pod
>> > > > (removing
>> > > > all pod
>> > > > resources, including that pod's sling directory) in response?
>> > > > Presumably,
>> > > > k8s would then start Sling Instance C, which is a fresh instance
>> > > > with
>> > > > no
>> > > > sling directory. Are we saying we expect C to have a 50/50 chance
>> > > > of
>> > > > starting successfully? Or have we observed different behavior?
>> > >
>> > > I think that only the first instance starts successfully.
>> > > Additional
>> > > instances will not start unless they have a Sling directory set up.
>> > >
>> > > I've tested with a third instance, once two instances are up, and
>> > > it
>> > > has the exact same behaviour.
>> > >
>> > > One workaround that I can suggest for a containerized environment
>> > > is to
>> > > use a supervisor script that detects the abnormal startup problem
>> > > and
>> > > restarts Sling, so that it starts up successfully.
>> > >
>> > > Another would be to persist the 'sling' directory as a per-
>> > > container
>> > > volume. Not sure how easy that is with k8s, but maybe you can use a
>> > > single ReadWriteMany volume at /sling, and each pod gets their own
>> > > ${sling.home} at /sling/${containerId} ( assuming that is exposed
>> > > through the downward API).
>> > >
>> > > As these are workardounds, I would still very much like to see this
>> > > fixed properly, so please file a bug to track this.
>> > >
>> > > Thanks,
>> > > Robert
>> > >
>> > > > Thanks,
>> > > > Ben
>> > > >
>> > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <camu...@redhat.com
>> > > > >
>> > > > wrote:
>> > > >
>> > > > > Thanks for the information Robert.
>> > > > >
>> > > > > To replicate the issue all I needed was a mongodb (I used a
>> > > > > full
>> > > > > replica
>> > > > > set, see my instructions in a previous email about how to get
>> > > > > one
>> > > > > going
>> > > > > using podman) and a single process running sling.
>> > > > >
>> > > > > The problem does happen when I do the following:
>> > > > >
>> > > > > 2. Start Sling instance A, wait for it to start
>> > > > > 3. Stop Sling instance A, wait for it to stop
>> > > > > 4. Start Sling instance B - Error
>> > > > >
>> > > > > but let me add more
>> > > > >
>> > > > > 5. Start Sling Instance A again - Success (note I didn't remove
>> > > > > the
>> > > > > sling
>> > > > > dir)
>> > > > > 6. Start Sling instance B again - Success (note I didn't remove
>> > > > > the
>> > > > > sling
>> > > > > dir)
>> > > > >
>> > > > > this means that even if Sling recreates the sling directory and
>> > > > > fails the
>> > > > > startup, next time it will succeed. Unfortunately we don't have
>> > > > > that luxury
>> > > > > in containers because the sling directory is not persisted.
>> > > > >
>> > > > > I think this is a bug, but I'll keep playing with it a bit to
>> > > > > see
>> > > > > if I can
>> > > > > find out more.
>> > > > >
>> > > > > Carlos
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <
>> > > > > romb...@apache.org
>> > > > > wrote:
>> > > > >
>> > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote:
>> > > > > > > Robert I managed to replicate the issue in a local, non-
>> > > > > > > containerized
>> > > > > > > environment (!!!).
>> > > > > > >
>> > > > > > > The problem seems to be when the database is kept but the
>> > > > > > > 'sling'
>> > > > > > > directory
>> > > > > > > is cleared out across restarts (as it is for us when the
>> > > > > > > container
>> > > > > > > goes
>> > > > > > > away). As I said before this doesn't seem to be a problem
>> > > > > > > with
>> > > > > > > the
>> > > > > > > Sling 11
>> > > > > > > bundles.
>> > > > > > >
>> > > > > > > The first basic solution will be to persist the 'sling'
>> > > > > > > directory
>> > > > > > > across
>> > > > > > > restarts, and I was wondering if this is a bug, or as
>> > > > > > > designed.
>> > > > > >
>> > > > > > I think this should work.
>> > > > > >
>> > > > > > > I also wonder if once persisted, multiple containers could
>> > > > > > > share this
>> > > > > > > directory.
>> > > > > >
>> > > > > > This directory can't be shared, as it holds runtime data
>> > > > > > related
>> > > > > > to
>> > > > > > Sling. For instance, a bundle that is started in instance A
>> > > > > > could
>> > > > > > be
>> > > > > > starting on instance B.
>> > > > > >
>> > > > > > There is at least one file ( sling.id ) that holds data that
>> > > > > > must
>> > > > > > not
>> > > > > > be the same between instances.
>> > > > > >
>> > > > > > So I would advise as marking the directory as container-
>> > > > > > private
>> > > > > > as a
>> > > > > > first step.
>> > > > > >
>> > > > > > Robert
>> > > > > >
>> > > > > > > Regards,
>> > > > > > >
>> > > > > > > Carlos
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz <
>> > > > > > > camu...@redhat.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Thanks Robert (and once again I can't stress enough how
>> > > > > > > > grateful I
>> > > > > > > > am for
>> > > > > > > > all your help).
>> > > > > > > >
>> > > > > > > > Right now we deploy our container with the expectation
>> > > > > > > > that
>> > > > > > > > the
>> > > > > > > > mongo db
>> > > > > > > > is the only necessary state we need to keep; everything
>> > > > > > > > else
>> > > > > > > > is
>> > > > > > > > throwaway.
>> > > > > > > > This means that a totally new container connected to the
>> > > > > > > > mongodb
>> > > > > > > > should
>> > > > > > > > pick up the state and run the same as the first time it
>> > > > > > > > was
>> > > > > > > > fired
>> > > > > > > > up. Do
>> > > > > > > > you think this is an incorrect assumption? If so, what
>> > > > > > > > are
>> > > > > > > > other
>> > > > > > > > pieces of
>> > > > > > > > state we should be keeping for subsequent restarts?
>> > > > > > > >
>> > > > > > > > This assumption has worked well for us with the current
>> > > > > > > > sling
>> > > > > > > > 11
>> > > > > > > > release,
>> > > > > > > > but it seems to break with the more up-to-date bundles.
>> > > > > > > > Perhaps
>> > > > > > > > running
>> > > > > > > > Sling in a container is just not meant to be.
>> > > > > > > >
>> > > > > > > > Regards,
>> > > > > > > >
>> > > > > > > > Carlos
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu <
>> > > > > > > > romb...@apache.org
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Carlos,
>> > > > > > > > >
>> > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz wrote:
>> > > > > > > > > > Thanks Bertrand. How can I run Sling with DEBUG-level
>> > > > > > > > > > logs for
>> > > > > > > > > > every
>> > > > > > > > > > bundle? I tried passing a few configuration arguments
>> > > > > > > > > > from the
>> > > > > > > > > > command line
>> > > > > > > > > > but nothing seemed to work.
>> > > > > > > > >
>> > > > > > > > > Try configuring the LogManager to debug at
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > >
>> https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138
>> > > > > > > > > Thanks,
>> > > > > > > > > Robert
>> > > > > > > > >
>> > > > > > > > > > Carlos
>> > > > > > > > > >
>> > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand Delacretaz <
>> > > > > > > > > > bdelacre...@apache.org>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi,
>> > > > > > > > > > >
>> > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz <
>> > > > > > > > > > > camu...@redhat.com>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > ...Is there a reason why the Jcr repository could
>> > > > > > > > > > > > be
>> > > > > > > > > > > > restarting?
>> > > > > > > > > > > > And what
>> > > > > > > > > > > > class could we start looking into to debug if
>> > > > > > > > > > > > this is
>> > > > > > > > > > > > the
>> > > > > > > > > > > > case?...
>> > > > > > > > > > >
>> > > > > > > > > > > It's not uncommon to see extra restarts of OSGi
>> > > > > > > > > > > components at
>> > > > > > > > > > > startup,
>> > > > > > > > > > > for various reasons.
>> > > > > > > > > > >
>> > > > > > > > > > > The simplest way to detect and log multiple
>> > > > > > > > > > > repository
>> > > > > > > > > > > startups
>> > > > > > > > > > > might
>> > > > > > > > > > > be to implement a SlingRepositoryInitializer
>> > > > > > > > > > > service
>> > > > > > > > > > > [1]
>> > > > > > > > > > > that's
>> > > > > > > > > > > called
>> > > > > > > > > > > at every startup, or use the logs of an existing
>> > > > > > > > > > > one
>> > > > > > > > > > > like the
>> > > > > > > > > > > JCR
>> > > > > > > > > > > RepositoryInitializer [2] if that has anything to
>> > > > > > > > > > > process in
>> > > > > > > > > > > your
>> > > > > > > > > > > system.
>> > > > > > > > > > >
>> > > > > > > > > > > -Bertrand
>> > > > > > > > > > >
>> > > > > > > > > > > [1]
>> > > > > > > > > > >
>> > >
>> https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer
>> > > > > > > > > > > [2]
>> > > > > > > > > > >
>> > >
>> https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98
>> > >
>> > >
>>
>>

Reply via email to