Hello all, I've been working out some boundary testing for my implementation of DERBY-728 and there's something I've found out that I'd like to discuss here in the list.
Right now in embedded mode we have support for all kinds of characters. In this mode, the database name length limit is 255 under Windows - as this is an OS limitation. I'm not sure about the behavior on other OSes but what I've come to notice is that this limit is applied on a character level. I'm not sure if Derby even applies a limit at all in embedded mode since we're capped at 255 by Windows. This means that in embedded mode, I can have a database name composed of 255 characters like this: 'ç'. Still, the 'ç' character takes up 2 bytes in UTF-8 and when we move to a client/server mode, the 255 length limit is applied to bytes and not characters (as specified by the DRDA specs and the ACR 7007). In practice, we will now have a discrepancy in name length limits. Until now, we had a 255 character limit in both functioning modes. In embedded mode we only care about characters and in client/server, since everything was ASCII (or rather, EBCDIC), 1 character equalled 1 byte which meant that the limit was the same for both cases. However, with this new CCSID manager which allows for UTF-8 characters in the client/server mode, things will change slightly. The 255 byte limit still applies as this is defined by the DRDA protocol, but characters may now take more than 1 byte. I said "may" because it really is "may" - using UTF-8, the length in bytes of each character is variable. The normal ASCII characters still just take 1 byte to encode, special Latin characters take 2 bytes, Chinese characters take 3 bytes and a whole other range of random characters take 4 bytes. What this all means is that there is no limit in characters that we can "advertise" as a cap for the dbname. Until now we could say that Derby imposes a 255 character limit on database names under client/server, but from now on the limit in characters will vary. If we use ONLY characters like these 'áèç', then the limit will actually be 127 (2 * 127 = 254 bytes, and we can't take another 2 byte'd character as we'd overrun the limit). But we can also use for example 249 ASCII characters and 2 Chinese characters, which is in fact a total of 251 characters (but 255 bytes, thus reaching the limit). Is this an okay behavior? Or would it be preferable to impose a more strict limit where we assume that all characters take 4 bytes (worst case scenario in UTF-8) and **always** cap the dbname length at 63 characters (255 bytes / 4 bytes)? This would mean more work for my implementation and possibly an exclusion from 10.7. On the other hand, if we have this variable-length limit depending on the type of characters used, we should probably have some sort of release note alerting people about this fact. Just wanted to get some thoughts and opinions on this... Thanks, Tiago
