Hi Michael,
On Tue, Jan 22, 2019, at 12:32 AM, Michael Menge wrote:
> Hi Ellie
>
> Quoting ellie timoney :
>
> > Hi Michael,
> >
> > On Wed, Jan 16, 2019, at 10:12 PM, Michael Menge wrote:
> >> Hi,
> >>
> >> because conversations db seems to be required for search indexes, I
> >> enabled this option
> >> on our production servers today and tried to rebuild the conversations
> >> db for all users with
> >>
> >> ctl_conversationsdb -v -b UID
> >>
> >> For most users this did take less than a second. But for some users
> >> this process would
> >> not finish. I did kill one process after about 30 Minutes (most others
> >> after 3 Minutes).
> >> The UserID.conversations has grown over 2 GB (the mailbox itself has
> >> only ~700 MB of mails,
> >> and the conversations files from finnished rebuilds are less then 20 MB)
> >
> > So, a 2GB-and-growing conversations db for a 700MB mailbox is a
> > ratio of ~3:1, which seems obscene...
> >
>
> thanks for confirming that this is not expected.
>
> > For comparison, the Conversations.append_reply_200 cassandane test
> > (two messages with 100 replies each) produces an 87Kb conversations
> > db for a 1016Kb mailbox (ratio ~1:12, n.b inverted!), which makes
> > your ~3:1 and growing ratio seem even more obscene...
> >
> > Obviously you can't share the bad conversations db because
> > conversations db's are full of personally-identifying information.
> > But I wonder if looking through it with `cyr_dbtool ... show` turns
> > up any unusual patterns, especially in comparison to one of the good
> > users? I wonder if it's somehow repeating itself?
>
> I was unable to run "cyr_dbtool ... show" or "ctl_conversationsdb -d"
> while ctl_conversationsdb -v -b was still running
> (waiting for a lock on conversations db file)
>
> After aborting "ctl_conversationsdb -b" "cyr_dbtool show" didn't show any
> duplicates. But the string command showed more occurrences than cyr_dbtool
> did (i used an hash value (d8087b76083ec5e7) I found in the output of
> cyr_dbtool
>
>
> # cyr_dbtool /path-to/userid.conversations skiplist show | grep
> d8087b76083ec5e7 | wc -l
> 3
> # cyr_dbtool /path-to/userid.conversations skiplist show | grep
> Bd8087b76083ec5e7 | wc -l
> 1
> # strings /path-to/userid.conversations | grep d8087b76083ec5e7 | wc -l
> 183
> # strings /path-to/userid.conversations | grep Bd8087b76083ec5e7 | wc -l
> 91
>
> ctl_conversationsdb -d did a cleanup because I did abort the
> "ctl_conversationsdb -b" process
>
> skiplist recovery /path-to/userid.conversations: found partial txn,
> not replaying
> skiplist: recovered /path-to/userid.conversations (0 records, 144
> bytes) in 0 seconds
> skiplist: checkpointed /path-to/userid.conversations (0 records, 144
> bytes) in 0.010 sec
>
>
> I did run ctl_conversationsd -b with strace and discovered that the inbox
> (cyrus.index O_RDWR, cyrus.header O_RDONLY, cache cyrus.O_RDWR ) of that user
> was opened over and over again, no other folder for that user was accessed.
> For other users the inbox was only opened once and then the other
> folders followed.
> So not stopping "ctl_conversationsd -b", it would have run till my
> filesystem was
> full.
>
I guess something about that inbox is confusing it, and making it redo it over
and over? Very curious
> > Bron, does this sound like anything you've seen before, that maybe
> > got fixed on master but not backported to 3.0?
> >
> >> Cyrus Logs show many "ctl_conversationsdb[933]: mailbox: longlock
> >> user.UserID for 1.5 seconds"
> >> every few seconds.
> >
> > This message is logged when the lock is released, if that lock had
> > been held for >1s. It's reporting that it held the lock for longer
> > than it would like to have, but at least you know the lock has been
> > released. I think this is telling you that user-visible performance
> > on the mailbox would have been impacted while this was occurring
> > (because imapd/lmtpd wouldn't be able to access their mailbox while
> > ctl_conversationsdb was holding those locks), but also tells you
> > that ctl_conversationsdb was doing the right thing, and not just
> > holding a single lock for the entire job.
> >
>
> I am not sure if any other processes had a chance to acquire a lock.
If they'd tried, they would have succeeded (by blocking until the lock was
theirs, during the gaps where ctl_conversationsdb dropped and reacquired the
lock), but maybe nothing else wanted it.
> > Though, that raises the question: was the user accessing the mailbox
> > while the rebuild was in progress? I wonder if their client is
> > doing something funny and tripping things up?
>
> We did this while the server was accessible (didn't want to have
> several hour downtime for ~44000 users)
Fair enough!
> But as I did my recent testing on a test-server without user access,
> only ctl_conversationsdb and cyr_dbtool did accessed the mailbox during
> the conversation rebuild mentioned