[
https://issues.apache.org/jira/browse/CASSANDRA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sylvain Lebresne updated CASSANDRA-7209:
----------------------------------------
Attachment: 0002-Rename-column_names-types-to-field_names-types.txt
0001-7209.txt
Patch attached for this. The "new" format simply use 4 bytes for value sizes
instead of 2 and drop the EOC byte. It's basically what makes sense for the
native protocol given other encodings. The patch does a tiny bit of renaming
too (columnNames->fieldNames and types->fieldTypes) because it's cleaner that
way I think. I include a 2nd patch that also rename column_names/types to
field_names/types in the schema table while at it for coherence with the code.
> Consider changing UDT serialization format before 2.1 release.
> --------------------------------------------------------------
>
> Key: CASSANDRA-7209
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7209
> Project: Cassandra
> Issue Type: Bug
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Fix For: 2.1 rc1
>
> Attachments: 0001-7209.txt,
> 0002-Rename-column_names-types-to-field_names-types.txt
>
>
> The current serialization format of UDT is the one of CompositeType. This was
> initially done on purpose, so that users that were using CompositeType for
> values in their thrift schema could migrate smoothly to UDT (it was also
> convenient code wise but that's a weak point).
> I'm having serious doubt about this being wise however for 2 reasons:
> * for each component, CompositeType stores an addition byte (the
> end-of-component) for reasons that only pertain to querying. This byte is
> basically wasted for UDT and makes no sense. I'll note that outside the
> inefficiency, there is also the fact that it will likely be pretty
> surprising/error-prone for driver authors.
> * it uses an unsigned short for the length of each component. While it's
> certainly not advisable in the current implementation to use values too big
> inside an UDT, having this limitation hard-coded in the serialization format
> is wrong and we've been bitten by this with collection already which we've
> had to fix in the protocol v3. It's probably worth no doing that mistake
> again. Furthermore, if we use an int for the size, we can use a negative size
> to represent a null value (the main point being that it's consistent with how
> we serialize values in the native protocol), which can be useful
> (CASSANDRA-7206).
> Of course, if we change that serialization format, we'd better do it before
> the 2.1 release. But I think the advantages outweigh the cons especially in
> the long run so I think we should do it. I'll try to work out a patch quickly
> so if you have a problem with the principle of this issue, it would be nice
> to voice it quickly.
> I'll note that doing that change will mean existing CompositeType values
> won't be able to be migrated transparently to UDT. I think this was anecdotal
> in the first place at best, I don't think using CompositeType for values is
> that popular in thrift tbh. Besides, if we really really want to, it might
> not be too hard to re-introduce that compatibility later by having some
> protocol level trick. We can't change the serialization format without
> breaking people however.
--
This message was sent by Atlassian JIRA
(v6.2#6252)