AFAIK there is no way to disable hoisting. Feel free to let your jira fingers do the talking.
Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/12/2012, at 6:10 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > Is there a way to turn this on and off through configuration? I am not > necessarily sure I would want this feature. Also it is confusing if these > writes show up in JMX and look like user generated write operations. > > > On Mon, Dec 17, 2012 at 10:01 AM, Mike <mthero...@yahoo.com> wrote: > Thank you Aaron, this was very helpful. > > Could it be an issue that this optimization does not really take effect until > the memtable with the hoisted data is flushed? In my simple example below, > the same row is updated and multiple selects of the same row will result in > multiple writes to the memtable. It seems it maybe possible (although > unlikely) where, if you go from a write-mostly to a read-mostly scenario, you > could get into a state where you are stuck rewriting to the same memtable, > and the memtable is not flushed because it absorbs the over-writes. I can > foresee this especially if you are reading the same rows repeatedly. > > I also noticed from the codepaths that if Row caching is enabled, this > optimization will not occur. We made some changes this weekend to make this > column family more suitable to row-caching and enabled row-caching with a > small cache. Our initial results is that it seems to have corrected the > write counts, and has increased performance quite a bit. However, are there > any hidden gotcha's there because this optimization is not occurring? > https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction > is behind" problem. Any history on that? I couldn't find too much > information on it. > > Thanks, > -Mike > > On 12/16/2012 8:41 PM, aaron morton wrote: >> >>> 1) Am I reading things correctly? >> Yes. >> If you do a read/slice by name and more than min compaction level nodes >> where read the data is re-written so that the next read uses fewer SSTables. >> >>> 2) What is really happening here? Essentially minor compactions can occur >>> between 4 and 32 memtable flushes. Looking through the code, this seems to >>> only effect a couple types of select statements (when selecting a specific >>> column on a specific key being one of them). During the time between these >>> two values, every "select" statement will perform a write. >> Yup, only for readying a row where the column names are specified. >> Remember minor compaction when using SizedTiered Compaction (the default) >> works on buckets of the same size. >> >> Imagine a row that had been around for a while and had fragments in more >> than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier >> and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If >> this row is read it will get hoisted back up. >> >> But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier >> it will not hoisted. >> >> There are a few short circuits in the SliceByName read path. One of them is >> to end the search when we know that no other SSTables contain columns that >> should be considered. So if the 4 columns you read frequently are hoisted >> into the 1st bucket your reads will get handled by that one bucket. >> >> It's not every select. Just those that touched more the min compaction >> sstables. >> >> >>> 3) Is this desired behavior? Is there something else I should be looking >>> at that could be causing this behavior? >> Yes. >> https://issues.apache.org/jira/browse/CASSANDRA-2503 >> >> Cheers >> >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 15/12/2012, at 12:58 PM, Michael Theroux <mthero...@yahoo.com> wrote: >> >>> Hello, >>> >>> We have an unusual situation that I believe I've reproduced, at least >>> temporarily, in a test environment. I also think I see where this issue is >>> occurring in the code. >>> >>> We have a specific column family that is under heavy read and write load on >>> a nightly basis. For the purposes of this description, I'll refer to this >>> column family as "Bob". During this nightly processing, sometimes Bob is >>> under very write load, other times it is very heavy read load. >>> >>> The application is such that when something is written to Bob, a write is >>> made to one of two other tables. We've witnessed a situation where the >>> write count on Bob far outstrips the write count on either of the other >>> tables, by a factor of 3->10. This is based on the WriteCount available on >>> the column family JMX MBean. We have not been able to find where in our >>> code this is happening, and we have gone as far as tracing our CQL calls to >>> determine that the relationship between Bob and the other tables are what >>> we expect. >>> >>> I brought up a test node to experiment, and see a situation where, when a >>> "select" statement is executed, a write will occur. >>> >>> In my test, I perform the following (switching between nodetool and cqlsh): >>> >>> update bob set 'about'='coworker' where key='<hex key>'; >>> nodetool flush >>> update bob set 'about'='coworker' where key='<hex key>'; >>> nodetool flush >>> update bob set 'about'='coworker' where key='<hex key>'; >>> nodetool flush >>> update bob set 'about'='coworker' where key='<hex key>'; >>> nodetool flush >>> update bob set 'about'='coworker' where key='<hex key>'; >>> nodetool flush >>> >>> Then, for a period of time (before a minor compaction occurs), a select >>> statement that selects specific columns will cause writes to occur in the >>> write count of the column family: >>> >>> select about,changed,data from bob where key='<hex key>'; >>> >>> This situation will continue until a minor compaction is completed. >>> >>> I went into the code and added some traces to CollationController.java: >>> >>> private ColumnFamily collectTimeOrderedData() >>> { >>> logger.debug("collectTimeOrderedData"); >>> >>> ... <snip> ... >>> >>> ---> HERE logger.debug( "tables iterated: " + sstablesIterated + " Min >>> compact: " + cfs.getMinimumCompactionThreshold() ); >>> // "hoist up" the requested data into a more recent sstable >>> if (sstablesIterated > cfs.getMinimumCompactionThreshold() >>> && !cfs.isCompactionDisabled() >>> && cfs.getCompactionStrategy() instanceof >>> SizeTieredCompactionStrategy) >>> { >>> RowMutation rm = new RowMutation(cfs.table.name, new >>> Row(filter.key, returnCF.cloneMe())); >>> try >>> { >>> ---> HERE logger.debug( "Apply hoisted up row mutation" ); >>> // skipping commitlog and index updates is fine since >>> we're just de-fragmenting existing data >>> Table.open(rm.getTable()).apply(rm, false, false); >>> } >>> catch (IOException e) >>> { >>> // log and allow the result to be returned >>> logger.error("Error re-writing read results", e); >>> } >>> } >>> ... >>> <snip> ... >>> >>> >>> >>> Performing >>> the steps above, I see the following traces >>> (in the test environment I decreased the >>> minimum compaction threshold to make this >>> easier to reproduce). After I do a couple of >>> update/flush, I see this in the log: >>> >>> >>> >>> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line >>> 117) Scheduling a background task check for bob with >>> SizeTieredCompactionStrategy >>> >>> >>> >>> Then, until compaction occurs, I see (when performing a select): >>> >>> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line >>> 86) Disseminating load info ... >>> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) >>> execute_cql_query >>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) >>> CQL statement type: SELECT >>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) >>> Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', >>> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, >>> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', >>> columnName='null')', columns=[about,changed,data,])/ONE >>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) >>> Blockfor is 1; setting up requests to /10.0.4.20 >>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) >>> reading data locally >>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) >>> LocalReadRunnable reading SliceByNamesReadCommand(table='open', >>> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, >>> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', >>> columnName='null')', columns=[about,changed,data,]) >>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line >>> 68) In get top level columns: class >>> org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class >>> org.apache.cassandra.db.marshal.BytesType >>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line >>> 84) collectTimeOrderedData >>> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java >>> (line 188) tables iterated: 4 Min compact: 2 >>> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java >>> (line 198) Apply hoisted up row mutation >>> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying >>> mutation of row >>> 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342 >>> >>> The above traces will occur every time I repeat the above select statement. >>> >>> Minor compaction doesn't start until a few minutes after the request was >>> submitted above (note, this is an unloaded test node): >>> >>> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java >>> (line 45) Creating IntervalNode from >>> [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce... >>> >>> Once minor compaction occurs, the behavior around write count being >>> incremented stops, until more than the minimum compaction threshold >>> memtables are flush to disk. >>> >>> So, my questions are: >>> >>> 1) Am I reading things correctly? >>> >>> 2) What is really happening here? Essentially minor compactions can occur >>> between 4 and 32 memtable flushes. Looking through the code, this seems to >>> only effect a couple types of select statements (when selecting a specific >>> column on a specific key being one of them). During the time between these >>> two values, every "select" statement will perform a >>> write. >>> >>> 3) Is this desired behavior? Is there something else I should be looking >>> at that could be causing this behavior? >>> >>> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy. >>> Any help is appreciated, >>> Thanks, >>> -Mike >>> >>> >>> >> > >