On Tue, Apr 14, 2015 at 3:28 AM, Branko Čibej <br...@wandisco.com> wrote:
> On 13.04.2015 18:04, Stefan Fuhrmann wrote: > > On Mon, Apr 13, 2015 at 11:20 PM, Daniel Shahaf > > <d...@daniel.shahaf.name <mailto:d...@daniel.shahaf.name>> wrote: > > > > rhuij...@apache.org <mailto:rhuij...@apache.org> wrote on Mon, Apr > > 13, 2015 at 14:28:39 -0000: > > > - *has_props = (noderev->prop_rep->expanded_size > 4); > > > + *has_props = (noderev->prop_rep->expanded_size > 4 > > > + || (noderev->prop_rep->expanded_size == 0 > > > + && noderev->prop_rep->size > 4)); > > > > Having to remember, on every use of EXPANDED_SIZE, to check if > > it's zero > > is madness. > > > > > > The pain has been increasing and is now at a level > > where we should do something about it. > > > > > > Is there a good reason not to make the deserializer > > function handle this? If needed, by inventing a getter or a new > > struct > > member that doesn't have this quirk? > > > > > > The main problem is that we don't know whether SIZE is > > correct, either, without looking at the actual representation: > > It could be a self-delta of an empty fulltext. So, most of > > these tests are contextual, we e.g. know that these are > > property reps. Or we can allow for false positives or false > > negatives etc. > > > > One way to get around this problem is to basically use > > the approach of svn_fs_fs__file_length(), where we compare > > the contents checksum with with that of an empty rep. > > We might also restrict that check to "meaningful" values > > of "SIZE", i.e. those txdelta streams that produce an > > empty output. > > > > Assuming we find a truly reliable test (e.g. say txdelta is > > always 4 bytes and we can enumerate all MD5s), we could > > add it to the parser function in low_level.c. > > > > Thoughts? > > I believe that currently a txdelta representation will always be at > least 4 bytes long; it's represented in svndiff format, which has a > 4-byte magic header. Unless something has changed since I wrote that > part of the code years ago, we'll discard delta representations that are > larger than the fulltext, which implies that empty reps will always be > stored as fulltext. > That should still be the case. So, repositories written exclusively by SVN will be easy to check. For others, there is a fallback (see below). > I'd hate to rely on such heuristics in general, but for this specific > case, if they turn out to be valid ... and as long as we really > carefully document and cross-reference the assumptions ... and so on. > I wrote a simple test to verify (in less than 12mins) that no 4 byte sequence produces the same MD5 as the empty content. So, we can resolve EXPANDED_SIZE immediately after parsing like this: * EXPANDED_SIZE > 0 -> Done. * MD5 != MD5(empty) -> EXPANDED_SIZE := SIZE. Done. * SIZE == 4 -> Done. At that point, the rep may be either a non-SVN-generated txdelta - in which case EXPANDED_SIZE must already be correct - or is a PLAIN rep with MD5 hash collision. Both are extremely unlikely but can be solved by reading the representation header. Then both cases can be told apart: * PLAIN rep -> EXPANDED_SIZE := SIZE. * Done. -- Stefan^2. [[[ static svn_error_t * all_32_bit_md5(apr_pool_t *pool) { apr_pool_t *iterpool = svn_pool_create(pool); apr_uint64_t i; svn_checksum_t *checksum; /* Be sure we get the the correct MD5 for an empty input. */ SVN_ERR(svn_checksum(&checksum, svn_checksum_md5, "", 0, iterpool)); SVN_TEST_ASSERT(svn_checksum_is_empty_checksum(checksum)); /* No 4 byte sequence shall ever produce the same MD5 as the empty input. */ for (i = 0; i <= APR_UINT32_MAX; ++i) { apr_uint32_t value = (apr_uint32_t)i; svn_pool_clear(iterpool); SVN_ERR(svn_checksum(&checksum, svn_checksum_md5, &value, sizeof(value), iterpool)); SVN_TEST_ASSERT(!svn_checksum_is_empty_checksum(checksum)); } return SVN_NO_ERROR; } ]]]