I am following along conceptually - I want to make sure I understand what's being described.
Let's say Sling Instance A starts successfully the first time. If we restart Sling Instance A, we expect subsequent restarts to also succeed, without removing the sling directory. Now let's say Sling Instance B does NOT start successfully the first time. Despite that, we expect subsequent restarts to succeed without removing the sling directory. Correct so far? Assuming yes... what if this is running in k8s, and k8s sees that Sling Instance B did not start successfully, and kills the pod (removing all pod resources, including that pod's sling directory) in response? Presumably, k8s would then start Sling Instance C, which is a fresh instance with no sling directory. Are we saying we expect C to have a 50/50 chance of starting successfully? Or have we observed different behavior? Thanks, Ben On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <camu...@redhat.com> wrote: > Thanks for the information Robert. > > To replicate the issue all I needed was a mongodb (I used a full replica > set, see my instructions in a previous email about how to get one going > using podman) and a single process running sling. > > The problem does happen when I do the following: > > 2. Start Sling instance A, wait for it to start > 3. Stop Sling instance A, wait for it to stop > 4. Start Sling instance B - Error > > but let me add more > > 5. Start Sling Instance A again - Success (note I didn't remove the sling > dir) > 6. Start Sling instance B again - Success (note I didn't remove the sling > dir) > > this means that even if Sling recreates the sling directory and fails the > startup, next time it will succeed. Unfortunately we don't have that luxury > in containers because the sling directory is not persisted. > > I think this is a bug, but I'll keep playing with it a bit to see if I can > find out more. > > Carlos > > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <romb...@apache.org> > wrote: > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote: > > > Robert I managed to replicate the issue in a local, non-containerized > > > environment (!!!). > > > > > > The problem seems to be when the database is kept but the 'sling' > > > directory > > > is cleared out across restarts (as it is for us when the container > > > goes > > > away). As I said before this doesn't seem to be a problem with the > > > Sling 11 > > > bundles. > > > > > > The first basic solution will be to persist the 'sling' directory > > > across > > > restarts, and I was wondering if this is a bug, or as designed. > > > > I think this should work. > > > > > > > > I also wonder if once persisted, multiple containers could share this > > > directory. > > > > This directory can't be shared, as it holds runtime data related to > > Sling. For instance, a bundle that is started in instance A could be > > starting on instance B. > > > > There is at least one file ( sling.id ) that holds data that must not > > be the same between instances. > > > > So I would advise as marking the directory as container-private as a > > first step. > > > > Robert > > > > > > > > Regards, > > > > > > Carlos > > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz <camu...@redhat.com> > > > wrote: > > > > > > > Thanks Robert (and once again I can't stress enough how grateful I > > > > am for > > > > all your help). > > > > > > > > Right now we deploy our container with the expectation that the > > > > mongo db > > > > is the only necessary state we need to keep; everything else is > > > > throwaway. > > > > This means that a totally new container connected to the mongodb > > > > should > > > > pick up the state and run the same as the first time it was fired > > > > up. Do > > > > you think this is an incorrect assumption? If so, what are other > > > > pieces of > > > > state we should be keeping for subsequent restarts? > > > > > > > > This assumption has worked well for us with the current sling 11 > > > > release, > > > > but it seems to break with the more up-to-date bundles. Perhaps > > > > running > > > > Sling in a container is just not meant to be. > > > > > > > > Regards, > > > > > > > > Carlos > > > > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu <romb...@apache.org > > > > > > > > > wrote: > > > > > > > > > Hi Carlos, > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz wrote: > > > > > > Thanks Bertrand. How can I run Sling with DEBUG-level logs for > > > > > > every > > > > > > bundle? I tried passing a few configuration arguments from the > > > > > > command line > > > > > > but nothing seemed to work. > > > > > > > > > > Try configuring the LogManager to debug at > > > > > > > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138 > > > > > > > > > > Thanks, > > > > > Robert > > > > > > > > > > > Carlos > > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand Delacretaz < > > > > > > bdelacre...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz < > > > > > > > camu...@redhat.com> > > > > > > > wrote: > > > > > > > > ...Is there a reason why the Jcr repository could be > > > > > > > > restarting? > > > > > > > > And what > > > > > > > > class could we start looking into to debug if this is the > > > > > > > > case?... > > > > > > > > > > > > > > It's not uncommon to see extra restarts of OSGi components at > > > > > > > startup, > > > > > > > for various reasons. > > > > > > > > > > > > > > The simplest way to detect and log multiple repository > > > > > > > startups > > > > > > > might > > > > > > > be to implement a SlingRepositoryInitializer service [1] > > > > > > > that's > > > > > > > called > > > > > > > at every startup, or use the logs of an existing one like the > > > > > > > JCR > > > > > > > RepositoryInitializer [2] if that has anything to process in > > > > > > > your > > > > > > > system. > > > > > > > > > > > > > > -Bertrand > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer > > > > > > > [2] > > > > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98 > > > > > > > > > > > >