[
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679079#comment-13679079
]
Edward Capriolo edited comment on CASSANDRA-4175 at 6/9/13 2:59 PM:
--------------------------------------------------------------------
2995 says
{quote}
It could be advantageous for Cassandra to make the storage engine pluggable.
This could allow Cassandra to
deal with potential use cases where maybe the current sstables are not the
best fit
allow several types of internal storage formats (at the same time)
optimized for different data types
{quote}
Since this issue talks about reducing disk space it will be changing how data
is written, this seems to benefit people with mostly static column. It sounds
right on the money with 2995. However it goes beyond storage layer changes.
The feature makes a ton of sense and does not only benefit the cql3 case. Many
people have static columns and since 0.7 standard column families have had
schema as well.
If cassandra had a 'plugable storage format'. One of the things it the
'ColumnMapIdStorageFormat' could do is write the known schema to a small file
loaded in memory with each sstable, (like the bloom filter) that would contain
the mappings. In the end I think you would have to store this anyway because
the mappings would change over time and what is in the schema now may not be
fully accurate for old slushed tables. This would only save storage as
mentioned and the internode traffic could not be optimized with plugable
storage alone.
For compare and swap, well whatever, it's just one feature and no one has to
use it if they do not want to. However requiring all schema changes to need zk
is crazy scary to me. It is true that schema always needed to propagate before
it can be used. I personally do not want to have to install zk side by side
with all my cassandra installs, and I do not want to rely on it for schema
changes.
Architecturally building on zk is a house of cards. This was originally why I
chose cassandra over hbase (hbase had meta data on hdfs, and state information
with zk). The WORST thing that ever happens to cassandra is a node has a
corrupt schema or a disagreement. I restart/decommission rejoin the node and it
is fixed.
If we start storing bits of information (column ids, schema in zookeeper) we
become totally reliant on it, nodes may or may not be able to start up without
it, we may or not be able to make schema changes without it, and MOST
IMPORTANTLY, ITS AN SPOF THAT WHEN IT GOES CORRUPT will likely cause the
entire cluster to * die, or likely function in a way worse then death,
something like writing (corrupt ids column to files and hopelessly corrupting
everything).
No thanks to any ZK integration. ZK and centrally managed meta data = hbase.
was (Author: appodictic):
2995 says
{quote}
It could be advantageous for Cassandra to make the storage engine pluggable.
This could allow Cassandra to
deal with potential use cases where maybe the current sstables are not the
best fit
allow several types of internal storage formats (at the same time)
optimized for different data types
{quote}
Since this issue talks about reducing disk space it will be changing how data
is written, this seems to benefit people with mostly static column. It sounds
right on the money with 2995. However it goes beyond storage layer changes.
The feature makes a ton of sense and does not only benefit the cql3 case. Many
people have static columns and since 0.7 standard column families have had
schema as well.
If cassandra had a 'plugable storage format'. One of the things it the
'ColumnMapIdStorageFormat' could do is write the known schema to a small file
loaded in memory with each sstable, (like the bloom filter) that would contain
the mappings. In the end I think you would have to store this anyway because
the mappings would change over time and what is in the schema now may not be
fully accurate for old slushed tables. This would only save storage as
mentioned and the internode traffic could not be optimized with plugable
storage alone.
For compare and swap, well whatever, it's just one feature and no one has to
use it if they do not want to. However requiring all schema changes to need zk
is crazy scary to me. It is true that schema always needed to propagate before
it can be used. I personally do not want to have to install zk side by side
with all my cassandra installs, and I do not want to rely on it for schema
changes.
Architecturally building on zk is a house of cards. This was originally why I
chose cassandra over hbase (hbase had meta data on hdfs, and state information
with zk). The WORST think that ever happens to cassandra is a node has a
corrupt schema or a disagreement. I restart/decommission rejoin the node and it
is fixed.
If we start storing bits of information (column ids, schema in zookeeper) we
become totally reliant on it, nodes may or may not be able to start up without
it, we may or not be able to make schema changes without it, and MOST
IMPORTANTLY, ITS AN SPOF THAT WHEN IT GOES CORRUPT will likely cause the
entire cluster to * die, or likely function in a way worse then death,
something like writing (corrupt ids column to files and hopelessly corrupting
everything).
No thanks to any ZK integration. ZK and centrally managed meta data = hbase.
> Reduce memory, disk space, and cpu usage with a column name/id map
> ------------------------------------------------------------------
>
> Key: CASSANDRA-4175
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Jonathan Ellis
> Fix For: 2.1
>
>
> We spend a lot of memory on column names, both transiently (during reads) and
> more permanently (in the row cache). Compression mitigates this on disk but
> not on the heap.
> The overhead is significant for typical small column values, e.g., ints.
> Even though we intern once we get to the memtable, this affects writes too
> via very high allocation rates in the young generation, hence more GC
> activity.
> Now that CQL3 provides us some guarantees that column names must be defined
> before they are inserted, we could create a map of (say) 32-bit int column
> id, to names, and use that internally right up until we return a resultset to
> the client.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira