Re: [fossil-users] Cloning repository with large files very slow
On 06/21/2018 10:00 PM, E Cruz wrote: On 06/21/2018 08:38 PM, Richard Hipp wrote: Please rebuild your Fossil using the latest trunk check-in, then try your clone using the new --nocompress option. Report back whether or not this solves your problem. Using the new option cuts the cloning time of one of the repositories from 8 minutes down to 4 minutes. For another repository it went from 2 minutes down to to 1 minute, so basically using "--nocompress" is cutting the time in half. Most of the remaining time is in the "rebuilding repository meta-data" phase, but this change is a significant improvement already. Thanks On 06/22/2018 09:57 PM, E Cruz wrote: Delta encoding is still taking a significant amount of time, in particular the call to content_deltify() by add_one_mlink() in manifest.c. Commenting out this call to content_deltify() allows cloning my smaller repository with "--nocompress" to go from 1 minute down to 4 seconds. I am not familiar enough with fossil to know all the implications of commenting this call out, but the resulting cloned repository seems to be fine. If the noCompress flag could be propagated down so that this particular call to content_deltify() is skipped when cloning with the flag enabled, the clone operation time could be reduced to a small fraction of what it is now. Based on my previous findings, I will like to propose a way to reduce cloning time of repositories that contain large files with very large "deltas" between revisions. The change involves saving the "--nocompress" flag in fossil's global state and using it to skip calling content_deltify() from add_one_mlink() if the flag is set. I do not yet understand all the internals of fossil, so I have checked the proposed changes by cloning fossil's repository with and without --nocompress, then comparing the outputs of "fossil export --git". Outputs from both clones were identical. I also checked individual tables in the two clones and the only differences found were: 1. timestamp of |last-sync-url| entry in |config| table 2. timestamps for the cluster entries in the |tagxref| table 3. |pw| field in the |user| table. My understanding is these differences are expected to be present between clones. The way I implemented the proposed changes is shown in the included patch file. Could you take a look to see if they don't have any unintended side effect? Thanks. Index: src/clone.c == --- src/clone.c +++ src/clone.c @@ -128,10 +128,12 @@ const char *zHttpAuth; /* HTTP Authorization user:pass information */ int nErr = 0; int urlFlags = URL_PROMPT_PW | URL_REMEMBER; int syncFlags = SYNC_CLONE; int noCompress = find_option("nocompress",0,0)!=0; + + g.noCompressClone = noCompress; /* Also clone private branches */ if( find_option("private",0,0)!=0 ) syncFlags |= SYNC_PRIVATE; if( find_option("once",0,0)!=0) urlFlags &= ~URL_REMEMBER; if( find_option("verbose","v",0)!=0) syncFlags |= SYNC_VERBOSE; Index: src/main.c == --- src/main.c +++ src/main.c @@ -283,10 +283,11 @@ } reqPayload; /* request payload object (if any) */ cson_array *warnings; /* response warnings */ int timerId; /* fetched from fossil_timer_start() */ } json; #endif /* FOSSIL_ENABLE_JSON */ + int noCompressClone; /* True if cloning with --nocompress */ }; /* ** Macro for debugging: */ Index: src/manifest.c == --- src/manifest.c +++ src/manifest.c @@ -1245,11 +1245,11 @@ db_bind_int(, ":pfn", pfnid); db_bind_int(, ":mp", mperm); db_bind_int(, ":isaux", isPrimary==0); db_exec(); } - if( pid && fid ){ + if( !g.noCompressClone && pid && fid ){ content_deltify(pid, , 1, 0); } } /* ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
Delta encoding is still taking a significant amount of time, in particular the call to content_deltify() by add_one_mlink() in manifest.c. Commenting out this call to content_deltify() allows cloning my smaller repository with "--nocompress" to go from 1 minute down to 4 seconds. I am not familiar enough with fossil to know all the implications of commenting this call out, but the resulting cloned repository seems to be fine. If the noCompress flag could be propagated down so that this particular call to content_deltify() is skipped when cloning with the flag enabled, the clone operation time could be reduced to a small fraction of what it is now. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
On 06/21/2018 08:38 PM, Jungle Boogie wrote: Just curious, how often are you cloning and opening the repo? With git, I've had it take a little while to expand the repo. After cloning, opening the repository only takes a few seconds. Updating to any revision where the large file has changed also takes only a few seconds. For a comparison, clone the sqlite repo and see if it's quicker/slower/about the same as your repo clone. I believe the sqlite repo is larger than yours. Cloning sqlite took 55 seconds and the repository file is about 5MB larger than the one I am having problems with. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
On 06/21/2018 08:38 PM, Richard Hipp wrote: Please rebuild your Fossil using the latest trunk check-in, then try your clone using the new --nocompress option. Report back whether or not this solves your problem. Using the new option cuts the cloning time of one of the repositories from 8 minutes down to 4 minutes. For another repository it went from 2 minutes down to to 1 minute, so basically using "--nocompress" is cutting the time in half. Most of the remaining time is in the "rebuilding repository meta-data" phase, but this change is a significant improvement already. Thanks! ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
On 6/21/18, E Cruz wrote: > Is there a way to prevent fossil > from re-applying delta encoding when cloning? Please rebuild your Fossil using the latest trunk check-in, then try your clone using the new --nocompress option. Report back whether or not this solves your problem. -- D. Richard Hipp d...@sqlite.org ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
On Thu 21 Jun 2018 7:47 PM, E Cruz wrote: > On 06/21/2018 05:06 PM, Warren Young wrote: > > As mentioned in the original post, the majority of the time taken by the > clone operation seems to be spent re-calculating the delta encoding of the > large table definition files. I do not mind much the time it takes to > commit changes to the large tables, although if that can be improved it > would be welcomed. But once that is done, we have to pay for the delta > encoding on every future clone operation. That is the part I would like to > avoid if possible. > Just curious, how often are you cloning and opening the repo? With git, I've had it take a little while to expand the repo. For a comparison, clone the sqlite repo and see if it's quicker/slower/about the same as your repo clone. I believe the sqlite repo is larger than yours. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
On 06/21/2018 05:06 PM, Warren Young wrote: Are the differences merely at the binary level or is the semantic content also changing? Thanks for your reply. The files are not binary. They are C source files that define large arrays of floating point values. These arrays are rarely updated, but when they are, most of the values in the array change. Try unsetting repo-cksum on this repository, if it’s enabled: https://fossil-scm.org/index.html/help?cmd=repo-cksum The total size of the fossil repository file is about 65MB, not a huge repository. Following your suggestion, I checked the repo-cksum setting and it was not set. I still tried cloning after running "fossil unset repo-cksum", "fossil setting repo-cksum 0", and "fossil setting repo-cksum 1" on the source repository and all cases took about the same time to complete the clone. That seems to be inline with the documentation you pointed to, where it it mentions that the setting applies to checkouts, and there is no mention of cloning. As mentioned in the original post, the majority of the time taken by the clone operation seems to be spent re-calculating the delta encoding of the large table definition files. I do not mind much the time it takes to commit changes to the large tables, although if that can be improved it would be welcomed. But once that is done, we have to pay for the delta encoding on every future clone operation. That is the part I would like to avoid if possible. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Cloning repository with large files very slow
On Jun 21, 2018, at 1:25 PM, E Cruz wrote: > > some of the source files define a few very large tables. These tables do not > change often, but when they do most of their content is replaced with > something completely different from the previous version. Are the differences merely at the binary level or is the semantic content also changing? For instance, these two commands give entirely different output, but with identical semantic content: $ echo "Hello, world!" | bzip2 | od -t x1 $ echo "Hello, world!" | lz4 | od -c The point is, there may be an encoding for your data that reduces the size of the diffs to include only the semantic differences. As an example, storing two different versions of a PNG in Fossil is probably less efficient than storing the same logical image data in Windows BMP form plus a conversion command to PNG during the build process, since the BMP is uncompressed and has very little metadata, so that only the actual pixel differences are stored in Fossil. (If you doubt this, I actually tested it and reported on the differences some years ago here.) Many binary data formats have this same property. They’re optimized for the size of individual files on disk, not for the total size of a delta-and-gzip-compressed version control repository. Yet another example is all of the binary data formats based on ZIP: ODF, JAR, APK… They’d store more efficiently in Fossil if unpacked into a tree before being checked in between changes. > When changes to these files are committed, fossil takes a long time to > process the commit (a couple of minutes for a 20MB table, over 10min for a > 60MB table). That’s superlinear, which is bad. Try unsetting repo-cksum on this repository, if it’s enabled: https://fossil-scm.org/index.html/help?cmd=repo-cksum With that unset, you become responsible for local file integrity. Some of the more modern filesystems obviate the need for manual integrity checks or secondary checks like the one Fossil does: ZFS, APFS, etc. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
[fossil-users] Cloning repository with large files very slow
I am testing fossil with a repository where some of the source files define a few very large tables. These tables do not change often, but when they do most of their content is replaced with something completely different from the previous version. When changes to these files are committed, fossil takes a long time to process the commit (a couple of minutes for a 20MB table, over 10min for a 60MB table). It would be nice if fossil could handle these commits much faster, but that in itself is not a big problem for us because the tables rarely get changed. What is more problematic is that once a change to the tables has been committed, cloning the repository also takes a very long time. It seems the fossil clone command is attempting to re-apply delta encoding to all files in the repository and that is causing the slowdown. Is re-encoding necessary when performing a clone operation? If it is not necessary, Is there a way to prevent fossil from re-applying delta encoding when cloning? -- Edgardo M. Cruz | edgardo.c...@genkey.com ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users