Hi Robert, Just a friendly ping about this issue :)
We could try to submit a fix with some potential guidance from you. For example, which of the many Sling bundles should we start looking at? Regards, Carlos On Wed, Feb 26, 2020 at 7:24 AM Carlos Munoz <camu...@redhat.com> wrote: > Thanks Robert. As always your help is appreciated. > > On Fri, Feb 21, 2020 at 6:28 PM Robert Munteanu <romb...@apache.org> > wrote: > >> Thanks, Ben, >> >> I added a bit more detail, based on our mailing list conversations. >> I'll have limited access in the next two weeks, but if no one picks it >> up I'll look into it when I get back. >> >> Thanks, >> Robert >> >> On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote: >> > I went ahead and created >> > https://issues.apache.org/jira/browse/SLING-9118 >> > for this. Although the ultimate goal here is containerization, I >> > neglected >> > to include any details to that effect in the ticket, since the >> > behavior is >> > reproducible without that being a complicating factor. >> > >> > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu <romb...@apache.org> >> > wrote: >> > >> > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote: >> > > > I am following along conceptually - I want to make sure I >> > > > understand >> > > > what's >> > > > being described. >> > > > >> > > > Let's say Sling Instance A starts successfully the first time. If >> > > > we >> > > > restart Sling Instance A, we expect subsequent restarts to also >> > > > succeed, >> > > > without removing the sling directory. >> > > > Now let's say Sling Instance B does NOT start successfully the >> > > > first >> > > > time. >> > > > Despite that, we expect subsequent restarts to succeed without >> > > > removing the >> > > > sling directory. >> > > > >> > > > Correct so far? >> > > >> > > Yes, correct. >> > > >> > > > Assuming yes... what if this is running in k8s, and k8s sees that >> > > > Sling >> > > > Instance B did not start successfully, and kills the pod >> > > > (removing >> > > > all pod >> > > > resources, including that pod's sling directory) in response? >> > > > Presumably, >> > > > k8s would then start Sling Instance C, which is a fresh instance >> > > > with >> > > > no >> > > > sling directory. Are we saying we expect C to have a 50/50 chance >> > > > of >> > > > starting successfully? Or have we observed different behavior? >> > > >> > > I think that only the first instance starts successfully. >> > > Additional >> > > instances will not start unless they have a Sling directory set up. >> > > >> > > I've tested with a third instance, once two instances are up, and >> > > it >> > > has the exact same behaviour. >> > > >> > > One workaround that I can suggest for a containerized environment >> > > is to >> > > use a supervisor script that detects the abnormal startup problem >> > > and >> > > restarts Sling, so that it starts up successfully. >> > > >> > > Another would be to persist the 'sling' directory as a per- >> > > container >> > > volume. Not sure how easy that is with k8s, but maybe you can use a >> > > single ReadWriteMany volume at /sling, and each pod gets their own >> > > ${sling.home} at /sling/${containerId} ( assuming that is exposed >> > > through the downward API). >> > > >> > > As these are workardounds, I would still very much like to see this >> > > fixed properly, so please file a bug to track this. >> > > >> > > Thanks, >> > > Robert >> > > >> > > > Thanks, >> > > > Ben >> > > > >> > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <camu...@redhat.com >> > > > > >> > > > wrote: >> > > > >> > > > > Thanks for the information Robert. >> > > > > >> > > > > To replicate the issue all I needed was a mongodb (I used a >> > > > > full >> > > > > replica >> > > > > set, see my instructions in a previous email about how to get >> > > > > one >> > > > > going >> > > > > using podman) and a single process running sling. >> > > > > >> > > > > The problem does happen when I do the following: >> > > > > >> > > > > 2. Start Sling instance A, wait for it to start >> > > > > 3. Stop Sling instance A, wait for it to stop >> > > > > 4. Start Sling instance B - Error >> > > > > >> > > > > but let me add more >> > > > > >> > > > > 5. Start Sling Instance A again - Success (note I didn't remove >> > > > > the >> > > > > sling >> > > > > dir) >> > > > > 6. Start Sling instance B again - Success (note I didn't remove >> > > > > the >> > > > > sling >> > > > > dir) >> > > > > >> > > > > this means that even if Sling recreates the sling directory and >> > > > > fails the >> > > > > startup, next time it will succeed. Unfortunately we don't have >> > > > > that luxury >> > > > > in containers because the sling directory is not persisted. >> > > > > >> > > > > I think this is a bug, but I'll keep playing with it a bit to >> > > > > see >> > > > > if I can >> > > > > find out more. >> > > > > >> > > > > Carlos >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu < >> > > > > romb...@apache.org >> > > > > wrote: >> > > > > >> > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote: >> > > > > > > Robert I managed to replicate the issue in a local, non- >> > > > > > > containerized >> > > > > > > environment (!!!). >> > > > > > > >> > > > > > > The problem seems to be when the database is kept but the >> > > > > > > 'sling' >> > > > > > > directory >> > > > > > > is cleared out across restarts (as it is for us when the >> > > > > > > container >> > > > > > > goes >> > > > > > > away). As I said before this doesn't seem to be a problem >> > > > > > > with >> > > > > > > the >> > > > > > > Sling 11 >> > > > > > > bundles. >> > > > > > > >> > > > > > > The first basic solution will be to persist the 'sling' >> > > > > > > directory >> > > > > > > across >> > > > > > > restarts, and I was wondering if this is a bug, or as >> > > > > > > designed. >> > > > > > >> > > > > > I think this should work. >> > > > > > >> > > > > > > I also wonder if once persisted, multiple containers could >> > > > > > > share this >> > > > > > > directory. >> > > > > > >> > > > > > This directory can't be shared, as it holds runtime data >> > > > > > related >> > > > > > to >> > > > > > Sling. For instance, a bundle that is started in instance A >> > > > > > could >> > > > > > be >> > > > > > starting on instance B. >> > > > > > >> > > > > > There is at least one file ( sling.id ) that holds data that >> > > > > > must >> > > > > > not >> > > > > > be the same between instances. >> > > > > > >> > > > > > So I would advise as marking the directory as container- >> > > > > > private >> > > > > > as a >> > > > > > first step. >> > > > > > >> > > > > > Robert >> > > > > > >> > > > > > > Regards, >> > > > > > > >> > > > > > > Carlos >> > > > > > > >> > > > > > > >> > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz < >> > > > > > > camu...@redhat.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Thanks Robert (and once again I can't stress enough how >> > > > > > > > grateful I >> > > > > > > > am for >> > > > > > > > all your help). >> > > > > > > > >> > > > > > > > Right now we deploy our container with the expectation >> > > > > > > > that >> > > > > > > > the >> > > > > > > > mongo db >> > > > > > > > is the only necessary state we need to keep; everything >> > > > > > > > else >> > > > > > > > is >> > > > > > > > throwaway. >> > > > > > > > This means that a totally new container connected to the >> > > > > > > > mongodb >> > > > > > > > should >> > > > > > > > pick up the state and run the same as the first time it >> > > > > > > > was >> > > > > > > > fired >> > > > > > > > up. Do >> > > > > > > > you think this is an incorrect assumption? If so, what >> > > > > > > > are >> > > > > > > > other >> > > > > > > > pieces of >> > > > > > > > state we should be keeping for subsequent restarts? >> > > > > > > > >> > > > > > > > This assumption has worked well for us with the current >> > > > > > > > sling >> > > > > > > > 11 >> > > > > > > > release, >> > > > > > > > but it seems to break with the more up-to-date bundles. >> > > > > > > > Perhaps >> > > > > > > > running >> > > > > > > > Sling in a container is just not meant to be. >> > > > > > > > >> > > > > > > > Regards, >> > > > > > > > >> > > > > > > > Carlos >> > > > > > > > >> > > > > > > > >> > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu < >> > > > > > > > romb...@apache.org >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hi Carlos, >> > > > > > > > > >> > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz wrote: >> > > > > > > > > > Thanks Bertrand. How can I run Sling with DEBUG-level >> > > > > > > > > > logs for >> > > > > > > > > > every >> > > > > > > > > > bundle? I tried passing a few configuration arguments >> > > > > > > > > > from the >> > > > > > > > > > command line >> > > > > > > > > > but nothing seemed to work. >> > > > > > > > > >> > > > > > > > > Try configuring the LogManager to debug at >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > >> https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138 >> > > > > > > > > Thanks, >> > > > > > > > > Robert >> > > > > > > > > >> > > > > > > > > > Carlos >> > > > > > > > > > >> > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand Delacretaz < >> > > > > > > > > > bdelacre...@apache.org> >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Hi, >> > > > > > > > > > > >> > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz < >> > > > > > > > > > > camu...@redhat.com> >> > > > > > > > > > > wrote: >> > > > > > > > > > > > ...Is there a reason why the Jcr repository could >> > > > > > > > > > > > be >> > > > > > > > > > > > restarting? >> > > > > > > > > > > > And what >> > > > > > > > > > > > class could we start looking into to debug if >> > > > > > > > > > > > this is >> > > > > > > > > > > > the >> > > > > > > > > > > > case?... >> > > > > > > > > > > >> > > > > > > > > > > It's not uncommon to see extra restarts of OSGi >> > > > > > > > > > > components at >> > > > > > > > > > > startup, >> > > > > > > > > > > for various reasons. >> > > > > > > > > > > >> > > > > > > > > > > The simplest way to detect and log multiple >> > > > > > > > > > > repository >> > > > > > > > > > > startups >> > > > > > > > > > > might >> > > > > > > > > > > be to implement a SlingRepositoryInitializer >> > > > > > > > > > > service >> > > > > > > > > > > [1] >> > > > > > > > > > > that's >> > > > > > > > > > > called >> > > > > > > > > > > at every startup, or use the logs of an existing >> > > > > > > > > > > one >> > > > > > > > > > > like the >> > > > > > > > > > > JCR >> > > > > > > > > > > RepositoryInitializer [2] if that has anything to >> > > > > > > > > > > process in >> > > > > > > > > > > your >> > > > > > > > > > > system. >> > > > > > > > > > > >> > > > > > > > > > > -Bertrand >> > > > > > > > > > > >> > > > > > > > > > > [1] >> > > > > > > > > > > >> > > >> https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer >> > > > > > > > > > > [2] >> > > > > > > > > > > >> > > >> https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98 >> > > >> > > >> >>