[ 
https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725604#comment-13725604
 ] 

Benedict commented on CASSANDRA-2698:
-------------------------------------

Hi Yuki,

Had some fun rebasing, but think everything looks good now. A few things to 
note:

1) I'm not sure what you mean by not "serializing those" - for correctness I 
serialize all of the data in a node. Do you want me to change the serialization 
methods to not send these values? I don't log them the other end, but I would 
prefer they were sent to ensure no surprises for users of the data, and also 
because of some optimisations to difference() that rely on knowing the number 
of rows for each sub-tree. It's not a tremendous amount of data after all.

2) I've modified DifferencerTest, and created two versions of the 
testDifference() method - one that tests differences on an empty tree, and one 
which tests a tree that has been populated with rows. Previously only the 
former was tested. This is because the changes I made to difference() for my 
previous patch, which I have retained and which ensures contiguous ranges are 
emitted where possible, treats the entire empty tree as one contiguous 
difference range (since the only non-empty sub-range in the tree is different), 
which was breaking the previous test. This test now works with the fully 
populated tree, and the previous test now confirms that the whole tree is 
considered different when it is empty. It's possible you may want to not deploy 
these improvements in this patch, but it seems a good idea to me whilst it's 
being modified, and given that I'd made the change already. Since we're not 
logging the ranges themselves at this time it won't have any direct impact, but 
it will be useful if that ever changes, and might help with future debugging.

3) I've updated the MerkleTreeTest methods to test the serialization and 
difference changes, and introduced a new HistogramBuilderTest

4) The histogram is built differently from my first patch, and is described in 
HistogramBuilder. Basically rather than creating neat linear ranges, I 
calculate the mean and create ranges that are multiples of the standard 
deviation either side of the mean, up to min/max (or, in this case, 3 stdevs, 
plus one range to min/max)

5) One thing we might want to consider changing is the format of the 
EstimatedHistogram ranges in the log messages. I've reproduced faithfully the 
boundary conventions of the EstimatedHistogram, but this is not a user friendly 
convention - it has an exclusive lower bound and inclusive upper bound, as 
opposed to the typical opposite convention. As such we get ranges like (-1, 0] 
to represent the range containing only 0, as opposed to [0, 1)

Think that's everything. Should respond quickly to queries at the moment, so 
drop me a line if you have any questions.
                
> Instrument repair to be able to assess it's efficiency (precision)
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-2698
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Benedict
>            Priority: Minor
>              Labels: lhf
>         Attachments: nodetool_repair_and_cfhistogram.tar.gz, 
> patch_2698_v1.txt, patch.diff, patch-rebased.diff, patch.taketwo.alpha.diff
>
>
> Some reports indicate that repair sometime transfer huge amounts of data. One 
> hypothesis is that the merkle tree precision may deteriorate too much at some 
> data size. To check this hypothesis, it would be reasonably to gather 
> statistic during the merkle tree building of how many rows each merkle tree 
> range account for (and the size that this represent). It is probably an 
> interesting statistic to have anyway.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to