Re: [PR] [AMQ-8398] Fix conversion of 4-byte unicode characters between JMS and STOMP [activemq]

via GitHub Sun, 25 Aug 2024 22:34:01 -0700


azotcsit commented on PR #1284:
URL: https://github.com/apache/activemq/pull/1284#issuecomment-2309353953


   Thanks to everyone for the comments! They were very helpful!
   
   @cshannon 
   > So something needs fixing but I was just curious if using the built in JDK 
support might be a better option for maintainability going forward.
   
   Thanks for pointing this out! I checked 
https://issues.apache.org/jira/browse/AMQ-8122 ticket; I totally agree with the 
points mentioned in this ticket. So I re-wrote the existing test and split it 
to separate scenarios for 1,2,3,4 bytes characters.
   
   > I'm open to either using the changes here or the JDK implementation if we 
can get that to work (like I quickly added a test in my branch 
https://github.com/cshannon/activemq/commit/e68021d33bbc1396d7e72083ee17fd00b620e374
 , although it's not quite workign all the way as i noted). The main thing is 
making sure it's correct and we don't introduce a bug
   
   I totally support moving to standard Java utilities/implementation. The only 
reason I had not done that initially - I thought that we wanted to minimize 
amount of changes. But since community seems to be open to getting rid of the 
custom implementation, I'd be glad to migrate to the standard implementation.
   
   @tabish121 
   > We do something similar in the AMQP codec in proton-j for this same reason.
   
   Could you please share a link to the corresponding codebase, so I can 
explore it. I'm not sure it is necessary, but I might find something that could 
be missed in this PR.
   
   > which requires a buffer sized to the expected output encoding size so to 
allow the openwire encoder to not have to guess (possibly more than once) the 
code was created.
   
   I had not found a standard method for calculating the encoded size, so I 
kept the custom implementation for that. The rest, I moved to standard methods. 
It should be still compatible with all existing protocols.
   
   @cshannon
   
   > I tried testing writing a string with 4 byte unicode (the same one in the 
unit test in this PR) with the old/existing encoder and then decoding it using 
the JDK UTF-8 decoder in my branch and it breaks. What's interesting is things 
decode fine with the modified method in this PR to decode. 
   
   It is expected for the JDK UTF-8 decoder to fail as the existing encoder 
does not support 4-bytes characters. But the modified logic in the first commit 
should also fail to read data written by the existing encoder. Could you please 
double check this? I believe it did not work for me.
   
   > What's even more interesting to me is that even though the existnig code 
is only supposed to handle 3 byte unicode, i tried testing writing/reading the 
4 byte unicode string and it worked (which I guess there's a bug or some other 
special handling in the custom code)
   
   Yes, this is tricky! If the old/existing logic is used for both writing and 
reading, then it works even for 4-byte characters. In fact, 3-bytes logic 
serializes 4-bytes data incorrectly, but since it reads data back (and expects 
incorrect bytes), it works.
   
   > So I'm curious what the big differences are between the JDK version and 
the modified version in this PR. 
   
   Ideally, there should be no difference. In fact, I changed custom logic to 
the standard JDK methods and everything seems to be working the same way.
   
   > It's critical to make sure things will be able communicate across versions 
if sending 4 byte unicode. For example, any old messages in a data store would 
need to be able to be decoded still after upgrade and another example could be 
old clients with the old encoding logic sending messages with 4 byte unicode to 
a broker that has been updated.
   
   This is a very great point! I think the current changes are not 
backward-compatible. I'll give a few rounds of testing to 
backward-compatibility to make sure there is a problem. Most probably, it will 
require to develop some additional logic for that. Could you please point me to 
some code where there is some compatibility-related logic. I'd like to 
understand how this type of problems are handled currently.
   
   ---
   
   Summary for the second commit:
   * replaced custom serialization/deserialization logic with standard JDK 
methods
   * improved `DataByteArrayInputStreamTest.testNonAscii()` as per 
https://issues.apache.org/jira/browse/AMQ-8122
   
   Next steps:
   * test current logic for compatibility between JMS and STOMP protocols
   * test compatibility for old client connected to new server
   * test compatibility for new client connected to old server
   * test compatibility for persisted messages after upgrade
   * further development based on testing results and PR comments
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@activemq.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@activemq.apache.org
For additional commands, e-mail: gitbox-h...@activemq.apache.org
For further information, visit: https://activemq.apache.org/contact

Re: [PR] [AMQ-8398] Fix conversion of 4-byte unicode characters between JMS and STOMP [activemq]

Reply via email to