[
https://issues.apache.org/jira/browse/OAK-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159124#comment-15159124
]
Alex Parvulescu edited comment on OAK-3893 at 2/23/16 4:21 PM:
---------------------------------------------------------------
bq. The approach looks good to me. What concerns me a bit is the possibility
(and the uncertainty that goes along with it) of introducing inconsistencies
upon hitting a hash collision.
agree with the sentiment, up until very recently the template equals code was
broken, and we didn't even know about it.
bq. That's why I would prefer to refactor the cache into its own class exposing
an interface Key, which would be the type of its keys.
Not against the idea itself, I'm not sold on conflating 2 issues: refactoring
the cache, and reevaluating the template as a key. one doesn't influence the
other in any way. nor does having a {{Key}} interface make the hashing bits
more/less secure from collision.
bq. A Key implementation could then fall back to containing the whole template
or could use any other method for hashing (AbstractBlob e.g. uses SHA256 vs.
md5 in this case).
right, good example! I have no preference between sha256 or md5 really, we can
be consistent and use sha256 all over, if it's preferable. the decision of
putting the template or the hash in the cache still needs to be made
irrespective of the state of the cache's refactoring, so I would really
separate the 2.
Could we discuss the cache refactoring in a dedicated issue? I think it needs a
lot more thought and it would go in a direction that has little overlap with
the existing patch. We have quite a few caches already and even mixes of them
(take a look at the {{StringCache}} next at the {{SegmentId#getSegment}}'s
cache sampling method). making this cache more than a simple map needs careful
evaluation of performance, this issue is only about reducing the memory
footprint of the keys.
was (Author: alex.parvulescu):
bq. The approach looks good to me. What concerns me a bit is the possibility
(and the uncertainty that goes along with it) of introducing inconsistencies
upon hitting a hash collision.
agree with the sentiment, up until very recently the template equals code was
broken, and we didn't even know about it.
bq. That's why I would prefer to refactor the cache into its own class exposing
an interface Key, which would be the type of its keys.
Not against the idea itself, I'm not sold on conflating 2 issues: refactoring
the cache, and reevaluating the template as a key. one influence the other in
any way. nor does having a {{Key}} interface make the hashing bits more/less
secure.
bq. A Key implementation could then fall back to containing the whole template
or could use any other method for hashing (AbstractBlob e.g. uses SHA256 vs.
md5 in this case).
right, good example! I have no preference between sha256 or md5 really, we can
be consistent and use sha256 all over, if it's preferable. the decision of
putting the template or the hash in the cache still needs to be made
irrespective of the state of the cache's refactoring, so I would really
separate the 2.
Could we discuss the cache refactoring in a dedicated issue? I think it needs a
lot more thought and it would go in a direction that has little overlap with
the existing patch. We have quite a few caches already and even mixes of them
(take a look at the {{StringCache}} next at the {{SegmentId#getSegment}}'s
cache sampling method). making this cache more than a simple map needs careful
evaluation of performance, this issue is only about reducing the memory
footprint of the keys.
> SegmentWriter records cache could use thinner keys
> --------------------------------------------------
>
> Key: OAK-3893
> URL: https://issues.apache.org/jira/browse/OAK-3893
> Project: Jackrabbit Oak
> Issue Type: Technical task
> Components: segmentmk
> Reporter: Alex Parvulescu
> Assignee: Alex Parvulescu
> Priority: Minor
> Attachments: OAK-3893.patch
>
>
> The SegmentWriter keeps a records deduplication cache ('records' map) that
> maintains 2 types of mappings:
> * template -> recordid
> * strings -> recordid
> For the first one (template-> recordid) we can come up with a thinner
> representation of a template (a hash function that is fast and not very
> collision prone) so we don't have to keep a reference to each template object.
> Same applies for second one, similar to what is happening in the StringsCache
> now, we could keep the string value up to a certain size and beyond that,
> hash it and use that for the deduplication map.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)