[
https://issues.apache.org/jira/browse/CASSANDRA-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949150#comment-13949150
]
Benedict commented on CASSANDRA-6936:
-------------------------------------
bq. Or I guess we could have some conversion of representation when
receiving/sending values
I would settle for conversion when reading/writing from disk for these, but at
send/receive would be best, so that we can benefit from the changes in memory
as well. But our on-disk indexing is currently quite lacking, and improving
that would be a tremendous help by itself.
bq. I don't see an easy way to have a bytes comparable representation of say
IntegerType (since it's variable length)
[http://www.dlugosz.com/ZIP2/VLI.html] looks to be one pretty simple such
encoding, but there are others
bq. there is the custom types
This is more of an issue. DecimalType is also tricky (though still achievable
I'm sure). It _may_ be that we have a slow fallback for those types we decide
are too problematic to convert, but it would be good to aim for a situation
where we can have a fast route, and where we can make on-disk optimisations. In
an ideal world, though, we would simply not support indexing
(clustering/naming) on fields that can't be given this property (which is
probably very few, and probably not a major limitation).
bq. I'm rather uncomfortable with doing complex bit manipulations of the user
data... And since we do return that representation to the user, it's not like
we can change it to whatever suits us
I'm not sure your rationale for this. It seems an arbitrary distinction from
all of the other complex things we do to user data. All we do is shuffle
around/encode/wrap user data. This is exactly the kind of thing a database is
supposed to do to make the user's life easier, and in this event _we chose_ the
encoding, so the user has no specific attachment to it. We could easily create
new types that require no conversion, and encourage users to switch for
safety/efficiency, but so long as any conversion is lossless, it shouldn't be a
problem.
Investigating this has raised another related issue, which is that I only now
realised we store a 4-byte length for every single value. This seems immensely
wasteful, and at the same time as any of these changes we should push this
logic into AbstractType, so that those that are fixed length, or only need a
short length, or can otherwise encode their length, can decide for themselves
what size length to write.
> Make all byte representations of types comparable by their unsigned byte
> representation only
> --------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-6936
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6936
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Benedict
> Labels: performance
> Fix For: 3.0
>
>
> This could be a painful change, but is necessary for implementing a
> trie-based index, and settling for less would be suboptimal; it also should
> make comparisons cheaper all-round, and since comparison operations are
> pretty much the majority of C*'s business, this should be easily felt (see
> CASSANDRA-6553 and CASSANDRA-6934 for an example of some minor changes with
> major performance impacts). No copying/special casing/slicing should mean
> fewer opportunities to introduce performance regressions as well.
> Since I have slated for 3.0 a lot of non-backwards-compatible sstable
> changes, hopefully this shouldn't be too much more of a burden.
--
This message was sent by Atlassian JIRA
(v6.2#6252)