Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On Mon, May 22, 2017 at 10:56 PM, Johan Corveleynwrote: > On Tue, May 9, 2017 at 1:21 PM, Johan Corveleyn wrote: >> On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperling wrote: >>> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote: This code is still in trunk without any of the discussed improvements, so this change is currently part of 1.10.0-alpha1. If we don't implement the improvements I think we should check if we want to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. See discussion below Bert >>> >>> I think the proposed approach as implemented on trunk can no longer be >>> considered viable, unfortunately, because of this step: >>> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file > >>>and compare it with pristine's checksum stored in wc.db. >>> >>> Given that the SHA1 collision problem is real, we are now trying to stop >>> relying on hashes to compare content. So it does not make sense to add >>> new code which relies on hashes in this way, in my opinion. >>> >>> It seems that using SHA1 to compare content is key to the proposed approach. >>> If that is correct, then I don't agree with releasing 1.10 with this feature >>> and I would be in favour of reverting this change. >>> >>> Ivan, do you have any further comments on this thread? You have remained >>> silent for quite some time now :( >> >> Where are we with this? Seems the consensus is to revert r1759233 to >> not further increase our reliance on sha1? Or is there still a way to >> keep r1759233 in some way, and improve it to make the sha1 test >> "sensitive but not specific", like danielsh proposed? >> >> Ivan? > > Hmmm, I'm wondering, now we decided to disallow sha1 collisions to > enter the repository, how does this reflect on this discussion? > > Okay, it's another "collision-sensitive-dependency" on sha1, but there > are others in the working copy (pristines) and ra_serf, and since > we're assuming a non-sha1-collision world (by blocking them from the > repository), it makes little sense to me to simply revert this, and > not fix other sha1-dependencies. OTOH, if we want to gradually reduce > our dependence on sha1, this is an easy one to remove, since it's not > in production yet ... Dunno. > > However, there were also other, performance-related, objections to the > new implementation. Ivan proposed improvements [1] but they were never > implemented. > > So, what to do? > > Unless someone objects in the coming couple of days, I'd say we revert > r1759233. > > [1] https://svn.haxx.se/dev/archive-2016-09/0016.shtml > > -- > Johan I agree in not adding more new logs to the fire. I also agree we should attack the already-in-prod deps on SHA1 as well in the background-
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On Tue, May 9, 2017 at 1:21 PM, Johan Corveleynwrote: > On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperling wrote: >> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote: >>> This code is still in trunk without any of the discussed improvements, so >>> this change is currently part of 1.10.0-alpha1. >>> >>> If we don't implement the improvements I think we should check if we want >>> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. >>> >>> See discussion below >>> >>> Bert >> >> I think the proposed approach as implemented on trunk can no longer be >> considered viable, unfortunately, because of this step: >> >>> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file >>> > >>>and compare it with pristine's checksum stored in wc.db. >> >> Given that the SHA1 collision problem is real, we are now trying to stop >> relying on hashes to compare content. So it does not make sense to add >> new code which relies on hashes in this way, in my opinion. >> >> It seems that using SHA1 to compare content is key to the proposed approach. >> If that is correct, then I don't agree with releasing 1.10 with this feature >> and I would be in favour of reverting this change. >> >> Ivan, do you have any further comments on this thread? You have remained >> silent for quite some time now :( > > Where are we with this? Seems the consensus is to revert r1759233 to > not further increase our reliance on sha1? Or is there still a way to > keep r1759233 in some way, and improve it to make the sha1 test > "sensitive but not specific", like danielsh proposed? > > Ivan? Hmmm, I'm wondering, now we decided to disallow sha1 collisions to enter the repository, how does this reflect on this discussion? Okay, it's another "collision-sensitive-dependency" on sha1, but there are others in the working copy (pristines) and ra_serf, and since we're assuming a non-sha1-collision world (by blocking them from the repository), it makes little sense to me to simply revert this, and not fix other sha1-dependencies. OTOH, if we want to gradually reduce our dependence on sha1, this is an easy one to remove, since it's not in production yet ... Dunno. However, there were also other, performance-related, objections to the new implementation. Ivan proposed improvements [1] but they were never implemented. So, what to do? Unless someone objects in the coming couple of days, I'd say we revert r1759233. [1] https://svn.haxx.se/dev/archive-2016-09/0016.shtml -- Johan
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperlingwrote: > On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote: >> This code is still in trunk without any of the discussed improvements, so >> this change is currently part of 1.10.0-alpha1. >> >> If we don't implement the improvements I think we should check if we want >> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. >> >> See discussion below >> >> Bert > > I think the proposed approach as implemented on trunk can no longer be > considered viable, unfortunately, because of this step: > >> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file >> > >>>and compare it with pristine's checksum stored in wc.db. > > Given that the SHA1 collision problem is real, we are now trying to stop > relying on hashes to compare content. So it does not make sense to add > new code which relies on hashes in this way, in my opinion. > > It seems that using SHA1 to compare content is key to the proposed approach. > If that is correct, then I don't agree with releasing 1.10 with this feature > and I would be in favour of reverting this change. > > Ivan, do you have any further comments on this thread? You have remained > silent for quite some time now :( Where are we with this? Seems the consensus is to revert r1759233 to not further increase our reliance on sha1? Or is there still a way to keep r1759233 in some way, and improve it to make the sha1 test "sensitive but not specific", like danielsh proposed? Ivan? -- Johan
RE: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
Hi, > -Original Message- > From: Stefan Sperling [mailto:s...@elego.de] > Sent: Tuesday, April 04, 2017 12:19 PM > To: Daniel Shahaf > Cc: dev@subversion.apache.org > Subject: Re: Check SHA vs Content (was: RE: svn commit: r1759233 - > /subversion/trunk/subversion/libsvn_wc/questions.c) > > On Tue, Apr 04, 2017 at 10:00:31AM +, Daniel Shahaf wrote: > > Note that we can still do what rep-cache now does: treat equality of > > checksums as a sensitive but not specific test for bit-for-bit > > equality; that is: checksum the file, and if the resulting value is > > equal to the stored value, do a full-text comparison too. > > Yeah, that sounds OK. INHO, it's also still sensible to use SHA1 (or even MD5) as a "technical" checksum to protect against problems like transmission errors, storage failure, random bit flips or accidental changing of pristine copies. It's just that we must not use SHA1 as a sole means of identifying content or content identity. Best regards Markus Schaber CODESYS® a trademark of 3S-Smart Software Solutions GmbH Inspiring Automation Solutions 3S-Smart Software Solutions GmbH Dipl.-Inf. Markus Schaber | Product Development Core Technology Memminger Str. 151 | 87439 Kempten | Germany Tel. +49-831-54031-979 | Fax +49-831-54031-50 E-Mail: m.scha...@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com CODESYS forum: http://forum.codesys.com Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On Tue, Apr 04, 2017 at 10:00:31AM +, Daniel Shahaf wrote: > Note that we can still do what rep-cache now does: treat equality of > checksums as a sensitive but not specific test for bit-for-bit equality; > that is: checksum the file, and if the resulting value is equal to the > stored value, do a full-text comparison too. Yeah, that sounds OK.
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
Stefan Sperling wrote on Tue, Apr 04, 2017 at 11:33:18 +0200: > On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote: > > This code is still in trunk without any of the discussed improvements, so > > this change is currently part of 1.10.0-alpha1. > > > > If we don't implement the improvements I think we should check if we want > > to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. > > > > See discussion below > > > > Bert > > I think the proposed approach as implemented on trunk can no longer be > considered viable, unfortunately, because of this step: > > > > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file > > > >>>and compare it with pristine's checksum stored in wc.db. > > Given that the SHA1 collision problem is real, we are now trying to stop > relying on hashes to compare content. So it does not make sense to add > new code which relies on hashes in this way, in my opinion. Note that we can still do what rep-cache now does: treat equality of checksums as a sensitive but not specific test for bit-for-bit equality; that is: checksum the file, and if the resulting value is equal to the stored value, do a full-text comparison too. I'm not saying that this is what libsvn_wc should do; I'm just pointing out that it is still possible to use sha1 safely. Cheers, Daniel > It seems that using SHA1 to compare content is key to the proposed approach. > If that is correct, then I don't agree with releasing 1.10 with this feature > and I would be in favour of reverting this change.
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote: > This code is still in trunk without any of the discussed improvements, so > this change is currently part of 1.10.0-alpha1. > > If we don't implement the improvements I think we should check if we want > to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. > > See discussion below > > Bert I think the proposed approach as implemented on trunk can no longer be considered viable, unfortunately, because of this step: > > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file > > >>>and compare it with pristine's checksum stored in wc.db. Given that the SHA1 collision problem is real, we are now trying to stop relying on hashes to compare content. So it does not make sense to add new code which relies on hashes in this way, in my opinion. It seems that using SHA1 to compare content is key to the proposed approach. If that is correct, then I don't agree with releasing 1.10 with this feature and I would be in favour of reverting this change. Ivan, do you have any further comments on this thread? You have remained silent for quite some time now :(
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
This code is still in trunk without any of the discussed improvements, so this change is currently part of 1.10.0-alpha1. If we don't implement the improvements I think we should check if we want to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. See discussion below Bert On Thu, Sep 8, 2016 at 5:42 PM, Ivan Zhakovwrote: > On 5 September 2016 at 19:23, Ivan Zhakov wrote: > > On 5 September 2016 at 14:46, Bert Huijben wrote: > >>> -Original Message- > >>> From: i...@apache.org [mailto:i...@apache.org] > >>> Sent: maandag 5 september 2016 13:33 > >>> To: comm...@subversion.apache.org > >>> Subject: svn commit: r1759233 - > >>> /subversion/trunk/subversion/libsvn_wc/questions.c > >>> > >>> Author: ivan > >>> Date: Mon Sep 5 11:32:54 2016 > >>> New Revision: 1759233 > >>> > >>> URL: http://svn.apache.org/viewvc?rev=1759233=rev > >>> Log: > >>> Use SHA-1 checksum to find whether files are actually modified in > working > >>> copy if timestamps don't match. > >>> > >>> Before this change we were doing this: > >>> 1. Compare file timestamps: if they match, assume that files didn't > change. > >>> 2. Open pristine file. > >>> 3. Read properties from wc.db and find whether translation is required. > >>> 4. Compare filesize with pristine filesize for files that do not > >>>require translation. Assume that file is modified if the sizes > differ. > >>> 5. Compare detranslated contents of working file with pristine. > >>> > >>> Now behavior is the following: > >>> 1. Compare file timestamps: if they match, assume that files didn't > change. > >>> 3. Read properties from wc.db and find whether translation is required. > >>> 3. Compare filesize with pristine filesize for files that do not > >>>require translation. Assume that file is modified if the sizes > differ. > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file > >>>and compare it with pristine's checksum stored in wc.db. > >> > > Hi Bert, > > > >> We looked at this before, and this change has pro-s and con-s, > depending on specific use cases. > >> > > Thanks for bringing this to dev@ list, I was not aware that this topic > > was discussed before. > > > [...] > > >> If the file happens to be a database file or something similar > >> there is quite commonly a change in the first 'block', when > >> there are changes somewhere later on. (Checksum, change > >> counter, etc.). File formats like sqlite were explicitly designed > >> for this (and other cheap checks), with a change counter at the start. > > > >> I don't think we should 'just change behavior' here, if we don't > >> have actual usage numbers for our users. Perhaps we should make > >> this feature configurable... or depending on filesize. > >> > > > > Let me summarize all possible cases that I considered before my > > change. First of all some definitions: > > * Text file (T) -- text file that require translation, due to eol > > style or keywords expansion > > * Text file (N) -- text file that doesn't require translation > > * Binary file -- some kind of binary file (database, pdf, zip, docx). > > Let's assume that user doesn't configure svn:eol-style and > > svn:keywords for them. > > * WS -- size of working file > > * PS -- size of pristine file > > > > * Old=xxx -- average required read size for old behavior in terms of > > working and pristine file sizes > > * New=xxx -- average required read size for new behavior in terms of > > working and pristine file sizes > > > > 1. Text file (T), not modified: Old = WS + PS, New = WS > > 2. Text file (N), not modified: Old = WS + PS, New = WS > > 3. Binary file, not modified: Old = WS + PS, New = WS > > 4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = > WS > > 5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = > WS > > 6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS > > 7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS, > New = WS > > 8. Text file (N), modified, different size: Old = 0, New = 0 > > 9. Binary file, modified, different size: Old = 0, New = 0 > > > Hi Bert, > > I tested several different binary file formats for no-op/minimal change: > 1. Microsoft Word (docx): change single character at the end of document: >- filesize changes (case 9) >- first change at offset 2,295 of 233,323 > 2. Microsoft Word (doc): change single character at the end of document: >- filesize didn't change (case 6) >- first change at offset 540 of 479,232 > 3. sqlite database: insert one row to wc.db (2.5mb) >- filesize didn't change (case 6) >- first change at offset 27 > 4. zip archive: change single character in one of many text files (43 > mb uncompressed) >- filesize changes (case 9) >- first change at offset 7,182,933 of 10,352,080 > 5. pdf file: no-op change of 800kb file >- filesize changes (case 9) >-
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On 5 September 2016 at 19:23, Ivan Zhakovwrote: > On 5 September 2016 at 14:46, Bert Huijben wrote: >>> -Original Message- >>> From: i...@apache.org [mailto:i...@apache.org] >>> Sent: maandag 5 september 2016 13:33 >>> To: comm...@subversion.apache.org >>> Subject: svn commit: r1759233 - >>> /subversion/trunk/subversion/libsvn_wc/questions.c >>> >>> Author: ivan >>> Date: Mon Sep 5 11:32:54 2016 >>> New Revision: 1759233 >>> >>> URL: http://svn.apache.org/viewvc?rev=1759233=rev >>> Log: >>> Use SHA-1 checksum to find whether files are actually modified in working >>> copy if timestamps don't match. >>> >>> Before this change we were doing this: >>> 1. Compare file timestamps: if they match, assume that files didn't change. >>> 2. Open pristine file. >>> 3. Read properties from wc.db and find whether translation is required. >>> 4. Compare filesize with pristine filesize for files that do not >>>require translation. Assume that file is modified if the sizes differ. >>> 5. Compare detranslated contents of working file with pristine. >>> >>> Now behavior is the following: >>> 1. Compare file timestamps: if they match, assume that files didn't change. >>> 3. Read properties from wc.db and find whether translation is required. >>> 3. Compare filesize with pristine filesize for files that do not >>>require translation. Assume that file is modified if the sizes differ. >>> 4. Calculate SHA-1 checksum of detranslated contents of working file >>>and compare it with pristine's checksum stored in wc.db. >> > Hi Bert, > >> We looked at this before, and this change has pro-s and con-s, depending on >> specific use cases. >> > Thanks for bringing this to dev@ list, I was not aware that this topic > was discussed before. > [...] >> If the file happens to be a database file or something similar >> there is quite commonly a change in the first 'block', when >> there are changes somewhere later on. (Checksum, change >> counter, etc.). File formats like sqlite were explicitly designed >> for this (and other cheap checks), with a change counter at the start. > >> I don't think we should 'just change behavior' here, if we don't >> have actual usage numbers for our users. Perhaps we should make >> this feature configurable... or depending on filesize. >> > > Let me summarize all possible cases that I considered before my > change. First of all some definitions: > * Text file (T) -- text file that require translation, due to eol > style or keywords expansion > * Text file (N) -- text file that doesn't require translation > * Binary file -- some kind of binary file (database, pdf, zip, docx). > Let's assume that user doesn't configure svn:eol-style and > svn:keywords for them. > * WS -- size of working file > * PS -- size of pristine file > > * Old=xxx -- average required read size for old behavior in terms of > working and pristine file sizes > * New=xxx -- average required read size for new behavior in terms of > working and pristine file sizes > > 1. Text file (T), not modified: Old = WS + PS, New = WS > 2. Text file (N), not modified: Old = WS + PS, New = WS > 3. Binary file, not modified: Old = WS + PS, New = WS > 4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS > 5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS > 6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS > 7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS, New = > WS > 8. Text file (N), modified, different size: Old = 0, New = 0 > 9. Binary file, modified, different size: Old = 0, New = 0 > Hi Bert, I tested several different binary file formats for no-op/minimal change: 1. Microsoft Word (docx): change single character at the end of document: - filesize changes (case 9) - first change at offset 2,295 of 233,323 2. Microsoft Word (doc): change single character at the end of document: - filesize didn't change (case 6) - first change at offset 540 of 479,232 3. sqlite database: insert one row to wc.db (2.5mb) - filesize didn't change (case 6) - first change at offset 27 4. zip archive: change single character in one of many text files (43 mb uncompressed) - filesize changes (case 9) - first change at offset 7,182,933 of 10,352,080 5. pdf file: no-op change of 800kb file - filesize changes (case 9) - first change at offset 47 of 854,971 6. Photoshop image (psd): change one pixel in the middle - filesize changes (case 9) - first change at offset 32 of 69,615,507 With this in mind, I think we can improve the current approach so that in would be better in all possible cases. We could do this: 1. Open pristine file 2. Read first 4-16kb from pristine and normalized working file 3. Compare them: if they are equal then close pristine file, calculate SHA1 of normalized working file and compare it with checksum in wc.db. This behavior would only apply for only for files
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On 05.09.2016 19:09, Stefan Hett wrote: > On 9/5/2016 6:23 PM, Ivan Zhakov wrote: >> With all above the new behavior should be working better or the same >> in all cases. I agree that 50% approximation may be incorrect for some >> specific binary formats (case 6) like sqlite db. > To be fair, I'd argue that in case of binary file modifications the > approximation is quite off. Most binary formats (if not all) in our > repository differ in the first couple of bytes (if they were changed) > and therefore it's quite a significant difference whether we read the > full file contents of a single file (which might be >100MB) or just > the first few bytes of two files. > > As Bert already suggested, I totally support the statement that it's > quite a common design pattern for binary formats to have some > checksum, time stamp, counter value, filesize record, etc. at the > beginning of the file contents which is likely to differ, if the file > has changed. If you then take the file sizes differences between text > files and binary files into account (aka: text files usually being > quite small, while binary files usually being quite large) it > certainly has the potential to matter quite much that there's a > difference expected for the binary file comparison case. > > FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k > block and another for the full file) sounds therefore as a reasonable > suggestion. > > Last but not least the throughput of calculating the SHA-1 is also > restricted by the I/O throughput in practice. For working directories > I'd assume it's not too unlikely to still reside on some HDD (rather > than some faster cache or an SSD) so it'd be limited to around 20 MB/s > in practice. Given large binary files this might pose a significant > difference in certain (not uncommon) use-cases. > > Don't get this wrong: IMHO I agree that the SHA-1 approach is superior > (especially on Windows machines since it will reduce the cases where > two files have to be opened - pointer: anti virus scanner impacts). I > just share Bert's opinion here that the approach should be a bit > improved especially in light of binary file support. > > If it would be of any help, I could do some performance measurements > with the two approaches on our repository to get some real world > numbers to work with. We discussed the two approaches to death years ago and decided to keep the behaviour prior to r1759233, partly because we had no evidence one way or the other. Changing the behaviour without having performance measurements in hand doesn't seem like a really smart move to me ... so yes, every little bit of extra data would surely help. BTW, I strongly suggest not using the Subversion tree as a test case, it's mostly small text files and so not very representative. -- Brane
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On 9/5/2016 6:23 PM, Ivan Zhakov wrote: With all above the new behavior should be working better or the same in all cases. I agree that 50% approximation may be incorrect for some specific binary formats (case 6) like sqlite db. To be fair, I'd argue that in case of binary file modifications the approximation is quite off. Most binary formats (if not all) in our repository differ in the first couple of bytes (if they were changed) and therefore it's quite a significant difference whether we read the full file contents of a single file (which might be >100MB) or just the first few bytes of two files. As Bert already suggested, I totally support the statement that it's quite a common design pattern for binary formats to have some checksum, time stamp, counter value, filesize record, etc. at the beginning of the file contents which is likely to differ, if the file has changed. If you then take the file sizes differences between text files and binary files into account (aka: text files usually being quite small, while binary files usually being quite large) it certainly has the potential to matter quite much that there's a difference expected for the binary file comparison case. FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k block and another for the full file) sounds therefore as a reasonable suggestion. Last but not least the throughput of calculating the SHA-1 is also restricted by the I/O throughput in practice. For working directories I'd assume it's not too unlikely to still reside on some HDD (rather than some faster cache or an SSD) so it'd be limited to around 20 MB/s in practice. Given large binary files this might pose a significant difference in certain (not uncommon) use-cases. Don't get this wrong: IMHO I agree that the SHA-1 approach is superior (especially on Windows machines since it will reduce the cases where two files have to be opened - pointer: anti virus scanner impacts). I just share Bert's opinion here that the approach should be a bit improved especially in light of binary file support. If it would be of any help, I could do some performance measurements with the two approaches on our repository to get some real world numbers to work with. -- Regards, Stefan Hett
Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
On 5 September 2016 at 14:46, Bert Huijbenwrote: >> -Original Message- >> From: i...@apache.org [mailto:i...@apache.org] >> Sent: maandag 5 september 2016 13:33 >> To: comm...@subversion.apache.org >> Subject: svn commit: r1759233 - >> /subversion/trunk/subversion/libsvn_wc/questions.c >> >> Author: ivan >> Date: Mon Sep 5 11:32:54 2016 >> New Revision: 1759233 >> >> URL: http://svn.apache.org/viewvc?rev=1759233=rev >> Log: >> Use SHA-1 checksum to find whether files are actually modified in working >> copy if timestamps don't match. >> >> Before this change we were doing this: >> 1. Compare file timestamps: if they match, assume that files didn't change. >> 2. Open pristine file. >> 3. Read properties from wc.db and find whether translation is required. >> 4. Compare filesize with pristine filesize for files that do not >>require translation. Assume that file is modified if the sizes differ. >> 5. Compare detranslated contents of working file with pristine. >> >> Now behavior is the following: >> 1. Compare file timestamps: if they match, assume that files didn't change. >> 3. Read properties from wc.db and find whether translation is required. >> 3. Compare filesize with pristine filesize for files that do not >>require translation. Assume that file is modified if the sizes differ. >> 4. Calculate SHA-1 checksum of detranslated contents of working file >>and compare it with pristine's checksum stored in wc.db. > Hi Bert, > We looked at this before, and this change has pro-s and con-s, depending on > specific use cases. > Thanks for bringing this to dev@ list, I was not aware that this topic was discussed before. > With the compare to SHA we only have to read the new file, but we > always have to read the file 100%. > > With the older system we could bail on the first detected change. > I considered this trade off. See below. > If there is a change somewhere both systems read on average > 100% of the filesize... only if there is no actual change except > for the timestamp, the new system is less expensive. > As far I understand the average characteristics are: 1. Files are equal: a) old behavior: 100% read of working file + 100% read of pristine file = 200% of working file size. b) new behavior: 100% read of working file = 100% of working file size. 2. Files modified (but has the same size or require translation (!)): a) old behavior: 50% (average) read of working file + 50% (average) read of pristine file = 100% of working file size. b) new behavior: 100% read of working file = 100% of working file size. (Strictly speaking, average read size would also depend on the number of modifications, and it could be less than 50%.) Also libsvn_wc checks working file size for files that doesn't require translation, before comparing contents. And keyword expansion/newline translation doesn't make sense for binary files (like database, pdf, docx). And for most binary files format modification involves changing its size. (There were problem in old behavior because pristine file was opened *before* comparing working file size. Fixing that would require additional SQLite operation.) > If the file happens to be a database file or something similar > there is quite commonly a change in the first 'block', when > there are changes somewhere later on. (Checksum, change > counter, etc.). File formats like sqlite were explicitly designed > for this (and other cheap checks), with a change counter at the start. > I don't think we should 'just change behavior' here, if we don't > have actual usage numbers for our users. Perhaps we should make > this feature configurable... or depending on filesize. > Let me summarize all possible cases that I considered before my change. First of all some definitions: * Text file (T) -- text file that require translation, due to eol style or keywords expansion * Text file (N) -- text file that doesn't require translation * Binary file -- some kind of binary file (database, pdf, zip, docx). Let's assume that user doesn't configure svn:eol-style and svn:keywords for them. * WS -- size of working file * PS -- size of pristine file * Old=xxx -- average required read size for old behavior in terms of working and pristine file sizes * New=xxx -- average required read size for new behavior in terms of working and pristine file sizes 1. Text file (T), not modified: Old = WS + PS, New = WS 2. Text file (N), not modified: Old = WS + PS, New = WS 3. Binary file, not modified: Old = WS + PS, New = WS 4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS 5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS 6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS 7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS, New = WS 8. Text file (N), modified, different size: Old = 0, New = 0 9. Binary file, modified, different size: Old = 0, New = 0 (There is some
RE: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
Hi, From: Bert Huijben [mailto:b...@qqmail.nl] > From: i...@apache.org [mailto:i...@apache.org] > > Sent: maandag 5 september 2016 13:33 > > To: comm...@subversion.apache.org > > Subject: svn commit: r1759233 - > > /subversion/trunk/subversion/libsvn_wc/questions.c > > > > Author: ivan > > Date: Mon Sep 5 11:32:54 2016 > > New Revision: 1759233 > > > > URL: http://svn.apache.org/viewvc?rev=1759233=rev > > Log: > > Use SHA-1 checksum to find whether files are actually modified in > > working copy if timestamps don't match. > > > > Before this change we were doing this: > > 1. Compare file timestamps: if they match, assume that files didn't change. > > 2. Open pristine file. > > 3. Read properties from wc.db and find whether translation is required. > > 4. Compare filesize with pristine filesize for files that do not > >require translation. Assume that file is modified if the sizes differ. > > 5. Compare detranslated contents of working file with pristine. > > > > Now behavior is the following: > > 1. Compare file timestamps: if they match, assume that files didn't change. > > 3. Read properties from wc.db and find whether translation is required. > > 3. Compare filesize with pristine filesize for files that do not > >require translation. Assume that file is modified if the sizes differ. > > 4. Calculate SHA-1 checksum of detranslated contents of working file > >and compare it with pristine's checksum stored in wc.db. > > We looked at this before, and this change has pro-s and con-s, depending on > specific use cases. > > With the compare to SHA we only have to read the new file, but we always have > to read the file 100%. > > With the older system we could bail on the first detected change. > > If there is a change somewhere both systems read on average 100% of the > filesize... only if there is no actual change except for the timestamp, the > new system is less expensive.> > > If the file happens to be a database file or something similar there is quite > commonly a change in the first 'block', when there are changes somewhere > later on. (Checksum, change counter, etc.). File formats like sqlite were > explicitly designed for this (and other cheap checks), with a change counter > at the start. Maybe we could cache a checksum of the first block (4k) of each file in the wc.db, to profit from those changes. Thus, we only need to read the whole file if the first block checksum is equal. > I don't think we should 'just change behavior' here, if we don't have actual > usage numbers for our users. Perhaps we should make this feature > configurable... or depending on filesize. > > We certainly want the new behavior for non-pristine working copies (on the > IDEA list for years), but I'm not sure if we always want this behavior as > only option. > > This mail is partially, to just discuss this topic on the list, to make sure > everybody knows what happened here and why. > > > > Bert > > (Note that it is labor day in the USA today... so I don't expect many > responses until later this week) Best regards Markus Schaber CODESYS® a trademark of 3S-Smart Software Solutions GmbH Inspiring Automation Solutions 3S-Smart Software Solutions GmbH Dipl.-Inf. Markus Schaber | Product Development Core Technology Memminger Str. 151 | 87439 Kempten | Germany Tel. +49-831-54031-979 | Fax +49-831-54031-50 E-Mail: m.scha...@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com CODESYS forum: http://forum.codesys.com Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)
> -Original Message- > From: i...@apache.org [mailto:i...@apache.org] > Sent: maandag 5 september 2016 13:33 > To: comm...@subversion.apache.org > Subject: svn commit: r1759233 - > /subversion/trunk/subversion/libsvn_wc/questions.c > > Author: ivan > Date: Mon Sep 5 11:32:54 2016 > New Revision: 1759233 > > URL: http://svn.apache.org/viewvc?rev=1759233=rev > Log: > Use SHA-1 checksum to find whether files are actually modified in working > copy if timestamps don't match. > > Before this change we were doing this: > 1. Compare file timestamps: if they match, assume that files didn't change. > 2. Open pristine file. > 3. Read properties from wc.db and find whether translation is required. > 4. Compare filesize with pristine filesize for files that do not >require translation. Assume that file is modified if the sizes differ. > 5. Compare detranslated contents of working file with pristine. > > Now behavior is the following: > 1. Compare file timestamps: if they match, assume that files didn't change. > 3. Read properties from wc.db and find whether translation is required. > 3. Compare filesize with pristine filesize for files that do not >require translation. Assume that file is modified if the sizes differ. > 4. Calculate SHA-1 checksum of detranslated contents of working file >and compare it with pristine's checksum stored in wc.db. We looked at this before, and this change has pro-s and con-s, depending on specific use cases. With the compare to SHA we only have to read the new file, but we always have to read the file 100%. With the older system we could bail on the first detected change. If there is a change somewhere both systems read on average 100% of the filesize... only if there is no actual change except for the timestamp, the new system is less expensive. If the file happens to be a database file or something similar there is quite commonly a change in the first 'block', when there are changes somewhere later on. (Checksum, change counter, etc.). File formats like sqlite were explicitly designed for this (and other cheap checks), with a change counter at the start. I don't think we should 'just change behavior' here, if we don't have actual usage numbers for our users. Perhaps we should make this feature configurable... or depending on filesize. We certainly want the new behavior for non-pristine working copies (on the IDEA list for years), but I'm not sure if we always want this behavior as only option. This mail is partially, to just discuss this topic on the list, to make sure everybody knows what happened here and why. Bert (Note that it is labor day in the USA today... so I don't expect many responses until later this week)