[
https://issues.apache.org/jira/browse/AVRO-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193699#comment-13193699
]
Raymie Stata commented on AVRO-1006:
------------------------------------
Sam writes:{quote}On the 0.001% collision rate, that seems high to me - would a
128-bit hash be a better choice?{quote}
Thanks for pointing this out. Turns out the 0.001% is a bug in the writeup,
the actual probabilities are quite a bit lower: 3E-8 (0.000003%) for a
million-item cache, 3E-10 for 100K items, and 3E-12 for 10K items (I'd love to
have someone check my math). Assuming an insertion per minute into a
fixed-sized table (ie, random eviction), you'd expect a collision every year
with the 1M item cache, every century with 100K items, and every millennia with
10K items. This seems acceptable, especially since I expect these caches to be
closer to 10K items than a million (there's a bit of a discussion on this point
in the updated writeup). So are you happier now with 64 bits?
(The doc defines a canonical text for schemas, and fingerprints based that
text. The patch will contain a function for returning the canonical text.
This approach implicitly standardizes how one would take an MD5-or SHA-xxx
fingerprint of a schema, but perhaps I can be explicit on this point.)
> Fingerprints for Avro Schemas
> -----------------------------
>
> Key: AVRO-1006
> URL: https://issues.apache.org/jira/browse/AVRO-1006
> Project: Avro
> Issue Type: New Feature
> Components: java
> Reporter: Raymie Stata
> Assignee: Raymie Stata
> Labels: features
> Attachments: schema-fingerprinting.html, schema-fingerprinting.html
>
>
> Add function that returns a standardized, 64-bit fingerprint for schemas.
> Fingerprints are designed such that the chances of collisions is very, very
> low.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira