[ 
https://issues.apache.org/jira/browse/OAK-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159124#comment-15159124
 ] 

Alex Parvulescu edited comment on OAK-3893 at 2/23/16 4:21 PM:
---------------------------------------------------------------

bq. The approach looks good to me. What concerns me a bit is the possibility 
(and the uncertainty that goes along with it) of introducing inconsistencies 
upon hitting a hash collision.
agree with the sentiment, up until very recently the template equals code was 
broken, and we didn't even know about it.

bq. That's why I would prefer to refactor the cache into its own class exposing 
an interface Key, which would be the type of its keys.
Not against the idea itself, I'm not sold on conflating 2 issues: refactoring 
the cache, and reevaluating the template as a key. one doesn't influence the 
other in any way. nor does having a {{Key}} interface make the hashing bits 
more/less secure from collision.

bq. A Key implementation could then fall back to containing the whole template 
or could use any other method for hashing (AbstractBlob e.g. uses SHA256 vs. 
md5 in this case).
right, good example! I have no preference between sha256 or md5 really, we can 
be consistent and use sha256 all over, if it's preferable. the decision of 
putting the template or the hash in the cache still needs to be made 
irrespective of the state of the cache's refactoring, so I would really 
separate the 2. 

Could we discuss the cache refactoring in a dedicated issue? I think it needs a 
lot more thought and it would go in a direction that has little overlap with 
the existing patch. We have quite a few caches already and even mixes of them 
(take a look at the {{StringCache}} next at the {{SegmentId#getSegment}}'s 
cache sampling method). making this cache more than a simple map needs careful 
evaluation of performance, this issue is only about reducing the memory 
footprint of the keys.



was (Author: alex.parvulescu):
bq. The approach looks good to me. What concerns me a bit is the possibility 
(and the uncertainty that goes along with it) of introducing inconsistencies 
upon hitting a hash collision.
agree with the sentiment, up until very recently the template equals code was 
broken, and we didn't even know about it.

bq. That's why I would prefer to refactor the cache into its own class exposing 
an interface Key, which would be the type of its keys.
Not against the idea itself, I'm not sold on conflating 2 issues: refactoring 
the cache, and reevaluating the template as a key. one influence the other in 
any way. nor does having a {{Key}} interface make the hashing bits more/less 
secure.

bq. A Key implementation could then fall back to containing the whole template 
or could use any other method for hashing (AbstractBlob e.g. uses SHA256 vs. 
md5 in this case).
right, good example! I have no preference between sha256 or md5 really, we can 
be consistent and use sha256 all over, if it's preferable. the decision of 
putting the template or the hash in the cache still needs to be made 
irrespective of the state of the cache's refactoring, so I would really 
separate the 2. 

Could we discuss the cache refactoring in a dedicated issue? I think it needs a 
lot more thought and it would go in a direction that has little overlap with 
the existing patch. We have quite a few caches already and even mixes of them 
(take a look at the {{StringCache}} next at the {{SegmentId#getSegment}}'s 
cache sampling method). making this cache more than a simple map needs careful 
evaluation of performance, this issue is only about reducing the memory 
footprint of the keys.


> SegmentWriter records cache could use thinner keys
> --------------------------------------------------
>
>                 Key: OAK-3893
>                 URL: https://issues.apache.org/jira/browse/OAK-3893
>             Project: Jackrabbit Oak
>          Issue Type: Technical task
>          Components: segmentmk
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>         Attachments: OAK-3893.patch
>
>
> The SegmentWriter keeps a records deduplication cache ('records' map) that 
> maintains 2 types of mappings:
> * template -> recordid
> * strings -> recordid
> For the first one (template-> recordid) we can come up with a thinner 
> representation of a template (a hash function that is fast and not very 
> collision prone) so we don't have to keep a reference to each template object.
> Same applies for second one, similar to what is happening in the StringsCache 
> now, we could keep the string value up to a certain size and beyond that, 
> hash it and use that for the deduplication map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to