[ https://issues.apache.org/jira/browse/CASSANDRA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne updated CASSANDRA-7209: ---------------------------------------- Attachment: 0002-Rename-column_names-types-to-field_names-types.txt 0001-7209.txt Patch attached for this. The "new" format simply use 4 bytes for value sizes instead of 2 and drop the EOC byte. It's basically what makes sense for the native protocol given other encodings. The patch does a tiny bit of renaming too (columnNames->fieldNames and types->fieldTypes) because it's cleaner that way I think. I include a 2nd patch that also rename column_names/types to field_names/types in the schema table while at it for coherence with the code. > Consider changing UDT serialization format before 2.1 release. > -------------------------------------------------------------- > > Key: CASSANDRA-7209 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7209 > Project: Cassandra > Issue Type: Bug > Reporter: Sylvain Lebresne > Assignee: Sylvain Lebresne > Fix For: 2.1 rc1 > > Attachments: 0001-7209.txt, > 0002-Rename-column_names-types-to-field_names-types.txt > > > The current serialization format of UDT is the one of CompositeType. This was > initially done on purpose, so that users that were using CompositeType for > values in their thrift schema could migrate smoothly to UDT (it was also > convenient code wise but that's a weak point). > I'm having serious doubt about this being wise however for 2 reasons: > * for each component, CompositeType stores an addition byte (the > end-of-component) for reasons that only pertain to querying. This byte is > basically wasted for UDT and makes no sense. I'll note that outside the > inefficiency, there is also the fact that it will likely be pretty > surprising/error-prone for driver authors. > * it uses an unsigned short for the length of each component. While it's > certainly not advisable in the current implementation to use values too big > inside an UDT, having this limitation hard-coded in the serialization format > is wrong and we've been bitten by this with collection already which we've > had to fix in the protocol v3. It's probably worth no doing that mistake > again. Furthermore, if we use an int for the size, we can use a negative size > to represent a null value (the main point being that it's consistent with how > we serialize values in the native protocol), which can be useful > (CASSANDRA-7206). > Of course, if we change that serialization format, we'd better do it before > the 2.1 release. But I think the advantages outweigh the cons especially in > the long run so I think we should do it. I'll try to work out a patch quickly > so if you have a problem with the principle of this issue, it would be nice > to voice it quickly. > I'll note that doing that change will mean existing CompositeType values > won't be able to be migrated transparently to UDT. I think this was anecdotal > in the first place at best, I don't think using CompositeType for values is > that popular in thrift tbh. Besides, if we really really want to, it might > not be too hard to re-introduce that compatibility later by having some > protocol level trick. We can't change the serialization format without > breaking people however. -- This message was sent by Atlassian JIRA (v6.2#6252)