Re: Read operations resulting in a write?

aaron morton Tue, 18 Dec 2012 16:39:11 -0800

AFAIK there is no way to disable hoisting. 

Feel free to let your jira fingers do the talking.


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/12/2012, at 6:10 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:

> Is there a way to turn this on and off through configuration? I am not 
> necessarily sure I would want this feature. Also it is confusing if these 
> writes show up in JMX and look like user generated write operations.
> 
> 
> On Mon, Dec 17, 2012 at 10:01 AM, Mike <mthero...@yahoo.com> wrote:
> Thank you Aaron, this was very helpful.
> 
> Could it be an issue that this optimization does not really take effect until 
> the memtable with the hoisted data is flushed?  In my simple example below, 
> the same row is updated and multiple selects of the same row will result in 
> multiple writes to the memtable.  It seems it maybe possible (although 
> unlikely) where, if you go from a write-mostly to a read-mostly scenario, you 
> could get into a state where you are stuck rewriting to the same memtable, 
> and the memtable is not flushed because it absorbs the over-writes.  I can 
> foresee this especially if you are reading the same rows repeatedly.
> 
> I also noticed from the codepaths that if Row caching is enabled, this 
> optimization will not occur.  We made some changes this weekend to make this 
> column family more suitable to row-caching and enabled row-caching with a 
> small cache.  Our initial results is that it seems to have corrected the 
> write counts, and has increased performance quite a bit.  However, are there 
> any hidden gotcha's there because this optimization is not occurring?  
> https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction 
> is behind" problem.  Any history on that?  I couldn't find too much 
> information on it.
> 
> Thanks,
> -Mike
> 
> On 12/16/2012 8:41 PM, aaron morton wrote:
>> 
>>> 1) Am I reading things correctly?
>> Yes. 
>> If you do a read/slice by name and more than min compaction level nodes 
>> where read the data is re-written so that the next read uses fewer SSTables.
>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur 
>>> between 4 and 32 memtable flushes.  Looking through the code, this seems to 
>>> only effect a couple types of select statements (when selecting a specific 
>>> column on a specific key being one of them). During the time between these 
>>> two values, every "select" statement will perform a write.
>> Yup, only for readying a row where the column names are specified.
>> Remember minor compaction when using SizedTiered Compaction (the default) 
>> works on buckets of the same size. 
>> 
>> Imagine a row that had been around for a while and had fragments in more 
>> than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier 
>> and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If 
>> this row is read it will get hoisted back up. 
>> 
>> But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier 
>> it will not hoisted. 
>> 
>> There are a few short circuits in the SliceByName read path. One of them is 
>> to end the search when we know that no other SSTables contain columns that 
>> should be considered. So if the 4 columns you read frequently are hoisted 
>> into the 1st bucket your reads will get handled by that one bucket. 
>> 
>> It's not every select. Just those that touched more the min compaction 
>> sstables. 
>> 
>> 
>>> 3) Is this desired behavior?  Is there something else I should be looking 
>>> at that could be causing this behavior?
>> Yes.
>> https://issues.apache.org/jira/browse/CASSANDRA-2503
>> 
>> Cheers
>> 
>>    
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 15/12/2012, at 12:58 PM, Michael Theroux <mthero...@yahoo.com> wrote:
>> 
>>> Hello,
>>> 
>>> We have an unusual situation that I believe I've reproduced, at least 
>>> temporarily, in a test environment.  I also think I see where this issue is 
>>> occurring in the code.
>>> 
>>> We have a specific column family that is under heavy read and write load on 
>>> a nightly basis.   For the purposes of this description, I'll refer to this 
>>> column family as "Bob".  During this nightly processing, sometimes Bob is 
>>> under very write load, other times it is very heavy read load.
>>> 
>>> The application is such that when something is written to Bob, a write is 
>>> made to one of two other tables.  We've witnessed a situation where the 
>>> write count on Bob far outstrips the write count on either of the other 
>>> tables, by a factor of 3->10.  This is based on the WriteCount available on 
>>> the column family JMX MBean.  We have not been able to find where in our 
>>> code this is happening, and we have gone as far as tracing our CQL calls to 
>>> determine that the relationship between Bob and the other tables are what 
>>> we expect.
>>> 
>>> I brought up a test node to experiment, and see a situation where, when a 
>>> "select" statement is executed, a write will occur.
>>> 
>>> In my test, I perform the following (switching between nodetool and cqlsh):
>>> 
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> 
>>> Then, for a period of time (before a minor compaction occurs), a select 
>>> statement that selects specific columns will cause writes to occur in the 
>>> write count of the column family:
>>> 
>>> select about,changed,data from bob where key='<hex key>';
>>> 
>>> This situation will continue until a minor compaction is completed.
>>> 
>>> I went into the code and added some traces to CollationController.java:
>>> 
>>>    private ColumnFamily collectTimeOrderedData()
>>>     {
>>>         logger.debug("collectTimeOrderedData");
>>> 
>>>       ... <snip> ...
>>> 
>>> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min 
>>> compact: " + cfs.getMinimumCompactionThreshold() );
>>>             // "hoist up" the requested data into a more recent sstable
>>>             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>>>                 && !cfs.isCompactionDisabled()
>>>                 && cfs.getCompactionStrategy() instanceof 
>>> SizeTieredCompactionStrategy)
>>>             {
>>>                 RowMutation rm = new RowMutation(cfs.table.name, new 
>>> Row(filter.key, returnCF.cloneMe()));
>>>                 try
>>>                 {
>>> ---> HERE               logger.debug( "Apply hoisted up row mutation" );    
>>>                     // skipping commitlog and index updates is fine since 
>>> we're just de-fragmenting existing data
>>>                     Table.open(rm.getTable()).apply(rm, false, false);
>>>                 }
>>>                 catch (IOException e)
>>>                 {
>>>                     // log and allow the result to be returned
>>>                     logger.error("Error re-writing read results", e);
>>>                 }
>>>             } 
>>> ...
>>>                           <snip> ...
>>> 
>>> 
>>>                         
>>> Performing
>>>                           the steps above, I see the following traces
>>>                           (in the test environment I decreased the
>>>                           minimum compaction threshold to make this
>>>                           easier to reproduce). After I do a couple of
>>>                           update/flush, I see this in the log:
>>> 
>>> 
>>>                         
>>> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 
>>> 117) Scheduling a background task check for bob with 
>>> SizeTieredCompactionStrategy
>>> 
>>>                         
>>> 
>>> Then, until compaction occurs, I see (when performing a select):
>>> 
>>> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 
>>> 86) Disseminating load info ...
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) 
>>> execute_cql_query
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) 
>>> CQL statement type: SELECT
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) 
>>> Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', 
>>> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
>>> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
>>> columnName='null')', columns=[about,changed,data,])/ONE
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) 
>>> Blockfor is 1; setting up requests to /10.0.4.20
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) 
>>> reading data locally
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) 
>>> LocalReadRunnable reading SliceByNamesReadCommand(table='open', 
>>> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
>>> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
>>> columnName='null')', columns=[about,changed,data,])
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 
>>> 68) In get top level columns: class 
>>> org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class 
>>> org.apache.cassandra.db.marshal.BytesType
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 
>>> 84) collectTimeOrderedData
>>> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
>>> (line 188) tables iterated: 4 Min compact: 2
>>> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
>>> (line 198) Apply hoisted up row mutation
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying 
>>> mutation of row 
>>> 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
>>> 
>>> The above traces will occur every time I repeat the above select statement.
>>> 
>>> Minor compaction doesn't start until a few minutes after the request was 
>>> submitted above (note, this is an unloaded test node):
>>> 
>>> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java 
>>> (line 45) Creating IntervalNode from 
>>> [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
>>> 
>>> Once minor compaction occurs, the behavior around write count being 
>>> incremented stops, until more than the minimum compaction threshold 
>>> memtables are flush to disk.
>>> 
>>> So, my questions are:
>>> 
>>> 1) Am I reading things correctly?
>>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur 
>>> between 4 and 32 memtable flushes.  Looking through the code, this seems to 
>>> only effect a couple types of select statements (when selecting a specific 
>>> column on a specific key being one of them). During the time between these 
>>> two values, every "select" statement                       will perform a 
>>> write.
>>> 
>>> 3) Is this desired behavior?  Is there something else I should be looking 
>>> at that could be causing this behavior?
>>> 
>>> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
>>> Any help is appreciated,
>>> Thanks,
>>> -Mike
>>> 
>>> 
>>>  
>> 
> 
>

Re: Read operations resulting in a write?

Reply via email to