[ 
https://issues.apache.org/jira/browse/CASSANDRA-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949150#comment-13949150
 ] 

Benedict commented on CASSANDRA-6936:
-------------------------------------

bq. Or I guess we could have some conversion of representation when 
receiving/sending values

I would settle for conversion when reading/writing from disk for these, but at 
send/receive would be best, so that we can benefit from the changes in memory 
as well. But our on-disk indexing is currently quite lacking, and improving 
that would be a tremendous help by itself.

bq. I don't see an easy way to have a bytes comparable representation of say 
IntegerType (since it's variable length)

[http://www.dlugosz.com/ZIP2/VLI.html] looks to be one pretty simple such 
encoding, but there are others

bq.  there is the custom types

This is more of an issue. DecimalType is also tricky (though still achievable 
I'm sure). It _may_ be that we have a slow fallback for those types we decide 
are too problematic to convert, but it would be good to aim for a situation 
where we can have a fast route, and where we can make on-disk optimisations. In 
an ideal world, though, we would simply not support indexing 
(clustering/naming) on fields that can't be given this property (which is 
probably very few, and probably not a major limitation).

bq.  I'm rather uncomfortable with doing complex bit manipulations of the user 
data... And since we do return that representation to the user, it's not like 
we can change it to whatever suits us

I'm not sure your rationale for this. It seems an arbitrary distinction from 
all of the other complex things we do to user data. All we do is shuffle 
around/encode/wrap user data. This is exactly the kind of thing a database is 
supposed to do to make the user's life easier, and in this event _we chose_ the 
encoding, so the user has no specific attachment to it. We could easily create 
new types that require no conversion, and encourage users to switch for 
safety/efficiency, but so long as any conversion is lossless, it shouldn't be a 
problem. 

Investigating this has raised another related issue, which is that I only now 
realised we store a 4-byte length for every single value. This seems immensely 
wasteful, and at the same time as any of these changes we should push this 
logic into AbstractType, so that those that are fixed length, or only need a 
short length, or can otherwise encode their length, can decide for themselves 
what size length to write.

> Make all byte representations of types comparable by their unsigned byte 
> representation only
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6936
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6936
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>              Labels: performance
>             Fix For: 3.0
>
>
> This could be a painful change, but is necessary for implementing a 
> trie-based index, and settling for less would be suboptimal; it also should 
> make comparisons cheaper all-round, and since comparison operations are 
> pretty much the majority of C*'s business, this should be easily felt (see 
> CASSANDRA-6553 and CASSANDRA-6934 for an example of some minor changes with 
> major performance impacts). No copying/special casing/slicing should mean 
> fewer opportunities to introduce performance regressions as well.
> Since I have slated for 3.0 a lot of non-backwards-compatible sstable 
> changes, hopefully this shouldn't be too much more of a burden.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to