chibenwa commented on pull request #886:
URL: https://github.com/apache/james-project/pull/886#issuecomment-1058845783


   Hello,
   
   Here is a status of our ongoing testing work. We encounter several issues 
testing the Netty 4 migration on our pre-production environment. (@Arsnael is 
involved on it too)
   
   We did not yet reach the point where we could actually play our performance 
tests.
   
   # Issue 1: Partially written SELECT QRESYNC responses
   
   When using evolution MUA (evolution in debug mode logs the full IMAP 
exchange which is convenient) and QRESYNC is enabled, for one mailbox of the 
account an error is encountered. Evolution complains the response was 
truncated...
   
   Here is the exchange:
   
   ```
   [imapx:I] I/O: 'I00104 SELECT Spam (QRESYNC (203407991 88 2:37 (1,10,28 
2,11,29)))'
   [imapx:I] I/O: '* FLAGS (\Answered \Deleted \Draft \Flagged \Seen Junk 
NonJunk $Forwarded)'
   [imapx:I] I/O: '* 36 EXISTS'
   [imapx:I] I/O: '* 0 RECENT'
   [imapx:I] I/O: '* OK [UIDVALIDITY 203407991] UIDs valid'
   [imapx:I] I/O: '* OK [UNSEEN 1] MailboxMessage 1 is first unseen'
   [imapx:I] I/O: '* OK [PERMANENTFLAGS (\Answered \Deleted \Draft \Flagged 
\Seen Junk NonJunk $Forwarded \*)] Limited'
   [imapx:I] I/O: '* OK [HIGHESTMODSEQ 124] Highest'
   [imapx:I] I/O: '* OK [UIDNEXT 38] Predicted next UID'
   [imapx:I] I/O: '* VANISHED (EARLIER) '
   [imapx:I] I/O: ''
   ```
   
   You can see there is no 'OK' response.
   
   I try to reproduce locally but working with QRESYNC is horrible. We still 
need to conduct regression tests to know if this happens also on Netty 3...
   
   ## ISSUE 2:  Thunderbird & IDLE not valid in this state
   
   To reproduce, we open Thunderbird and switch mailboxes quickly. After a few 
quick mailbox switches, thunderbird complains the mailbox can't be opened and 
says `server answered: IDLE command invalid in this state`. All subsequent IMAP 
requests fails and Thunderbird need to be restarted to start in a clean state. 
It seems as if James logout the session without closing the channel. It's 
unclear why this happens... 
   
   `The current operation in `inbox` did not succeed. The mail server for 
account name [user_mail] responded: IDLE failed. Command not valid in this 
state.`
   
   Environment: distributed James in a cloud setup (k8s hosted at OVH). We 
don't reproduce locally. 
   
   We did not yet manage to get a traffic capture. 
   
   I suspect concurrency issues: one IMAP request could be processed before the 
previous one thus don't benefit from state changes of previous commands? To be 
honnset this is still unclear to me... I did spend a few hour reproducing by 
writing unit tests sending multiple commands at once but failed reproducing...
   
   I have a blind bet about this one: this changeset added the @Shareable 
annotation on a couple of handlers including the Imap handler, allowing 
concurrency on this handler might lead to incorrect handling in the context of 
a connected, stateful protocol. We need still to try such a change out...
   
   ## ISSUE 3: Gatling list
   
   Our performance tests lists mailboxes, append a few message, selects a 
mailbox, then fetches a few messages; however the listing always fails - 
gatling cannot find an INBOX. We reliably reproduce running gatling & james 
locally.
   
   ```
   
================================================================================
   2022-03-01 04:53:35                                         505s elapsed
   ---- Requests 
------------------------------------------------------------------
   > Global                                                   (OK=26316  
KO=8768  )
   > Connect                                                  (OK=2516   KO=0   
  )
   > login                                                    (OK=2516   KO=0   
  )
   > heavyScenario / append                                   (OK=3761   KO=0   
  )
   > lightScenario / list                                     (OK=0      
KO=1252  )
   > heavyScenario / list                                     (OK=0      
KO=7516  )
   > lightScenario / select                                   (OK=1251   KO=0   
  )
   > heavyScenario / select                                   (OK=7514   KO=0   
  )
   > heavyScenario / fetch                                    (OK=7508   KO=0   
  )
   > lightScenario / fetch                                    (OK=1250   KO=0   
  )
   ---- Errors 
--------------------------------------------------------------------
   > Unable to find folder 'INBOX' in                                 8768 
(100.0%)
   
   ---- ImapPlatformValidation 
----------------------------------------------------
   [---------------------------------------------------------------           ] 
 0%
             waiting: 475    / active: 2525   / done: 0     
   
================================================================================
   ```
   
   CF 
https://github.com/linagora/james-gatling/blob/master/src/it/scala-2.12/org/apache/james/gatling/imap/PlatformValidationIT.scala
   
   Using telnet, it seems we are getting "out of order" responses on the wire:
   
   ```
   A1 list "" "*"
   A1 OK LIST completed.
   * LIST (\HasNoChildren) "." "INBOX"
   * LIST (\HasNoChildren) "." "Sent"
   * LIST (\HasNoChildren) "." "Trash"
   ```
   
   Obviously we should be having the OK comming after the untagged responses...
   
   Would that mean we need to "await" or "chain" channel futures when we write 
responses to ensure a correct order?
   
   ## Issue 4: Blocking on netty IO event loop (Suspission)
   
   I suspect we currently run everything straight on the io event loop (woken 
up when channels receive reads/writes). I suspect we might have no choices but 
to run handlers performing DB query operations (but likely also synchronous 
channel read/writes) on a separate thread pool as it was done before. Local 
benchmarks conducted so far would fail at detecting the impact of "blocking on 
the event loop" as the concurrency level is low (less than 8 concurrent 
connections). I wonder how the current set up will behave with high concurrency 
levels (hundreds of concurrent connections).
   ## Actions
   
    - [ ] Evaluate if the QRESYNC issue is a regression compared to Netty 3
    - [ ] Unit tests reproducing the QRESYNC issue (hard)
    - [ ] Capture of what James thinks he sends for QRESYNC issue ?
    - [ ] Unit test reproducing the LIST ordering issue
    - [ ] Try if awaits after writes solves the LIST response ordering issue
    - [ ] Try the @Shareable change for IMAP handler and see if it have impacts 
on the IDLE TB issue
    
   Apparently, this changeset might keep us busy for still quite some time...
   
   Regards,
   
   Benoit
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to