Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
On Oct 30, 2016, at 1:57 AM, Karel Gardaswrote: > > On Fri, Oct 28, 2016 at 7:02 PM, Warren Young wrote: >> On Oct 28, 2016, at 3:45 AM, Karel Gardas wrote: >>> >>> make it more scale-able and allow its real usage also for projects of >>> bigger size. >> >> How many projects are there bigger than SQLite, percentage-wise? > > And does it really matter? Sure it does. Fossil is fast enough for SQLite, so if SQLite is “very large” compared to most other projects that could usefully use it, then speeding up Fossil amounts to spending effort on a tiny minority of users. All of that is predicated on that first “if,” however. >>> $ time /opt/fossil-head/bin/fossil clone >>> http://netbsd.sonnenberger.org/ netbsd-src.fossil >>> >>> It takes: >>> >>> real323m2.323s >>> user42m0.262s >>> sys 13m18.003s >> >> Okay, but compared to what? > > For example Git, on the same source tree: > > $ time git clone https://github.com/jsonn/src.git > Cloning into 'src'... > remote: Counting objects: 3725278, done. > remote: Compressing objects: 100% (111/111), done. > remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166 > Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done. > Resolving deltas: 100% (2782525/2782525), done. > Checking connectivity... done. > Checking out files: 100% (176388/176388), done. > > real55m20.926s > user9m30.362s > sys 4m50.320s So given your other report, that rebuild takes 250 minutes of that time, then Fossil is within about 25% of the speed of Git, if you don’t rebuild. >>> takes around 250 minutes on the same hardware and with the same >>> fossil. >> >> Would a --skip-rebuild option for fossil clone solve your major problem, >> then? > > There is no such option in current fossil. Some commands (import) > supports --no-rebuild, but clone is not among them. I didn’t tell you to use that option, I asked if you would like that option to exist. >> Rebuilding is strictly optional. It just makes Fossil operations run faster >> post-clone. > > That's news to me, I've thought rebuild is strictly necessary to have > other fossil commands working Nope. The only reason Fossil rebuilds by default is that the clone operation results in a sub-optimal DB, because each cloned artifact is checked into the new DB separately. You end up with a series of incremental states, none of which are equal to the final DB state once the clone is finished. Rebuilding forces the SQLite instance inside Fossil to take a new look at all the cloned artifacts as a whole and optimize the DB for that completed post-clone state, rather than the series of incremental states that exist at each point during the clone. > repo chksuming switching off as suggested by Nikita Borodikhin helps > here a lot. The question is if to leave it to file-system or to fossil > itself. If your filesystem has strong data checksumming (as opposed to just metadata checksumming) then I see no reason to leave repo-cksum turned on. Keep in mind that the vast majority of filesystems in common use do *not* have strong data checksumming, so letting repo-cksum default on is a good idea. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
On Fri, Oct 28, 2016 at 7:02 PM, Warren Youngwrote: > On Oct 28, 2016, at 3:45 AM, Karel Gardas wrote: >> >> make it more scale-able and allow its real usage also for projects of >> bigger size. > > How many projects are there bigger than SQLite, percentage-wise? And does it really matter? The question was just question, anybody working scaling fossil better? I've though fossil is just another open-source DVCS where people are free to hack their ideas at least to test some new directions. >> $ time /opt/fossil-head/bin/fossil clone >> http://netbsd.sonnenberger.org/ netbsd-src.fossil >> >> It takes: >> >> real323m2.323s >> user42m0.262s >> sys 13m18.003s > > Okay, but compared to what? For example Git, on the same source tree: $ time git clone https://github.com/jsonn/src.git Cloning into 'src'... remote: Counting objects: 3725278, done. remote: Compressing objects: 100% (111/111), done. remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166 Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done. Resolving deltas: 100% (2782525/2782525), done. Checking connectivity... done. Checking out files: 100% (176388/176388), done. real55m20.926s user9m30.362s sys 4m50.320s >> rebuild alone >> takes around 250 minutes on the same hardware and with the same >> fossil. > > Would a --skip-rebuild option for fossil clone solve your major problem, then? There is no such option in current fossil. Some commands (import) supports --no-rebuild, but clone is not among them. > Rebuilding is strictly optional. It just makes Fossil operations run faster > post-clone. That's news to me, I've thought rebuild is strictly necessary to have other fossil commands working if not, hmm, allowing rebuild over night may be indeed an option. > We’ve already discussed shallow clones, which would make Fossil more like CVS > in terms of clone size. See the “Fossil 2.0” document Mr. Boogie linked to. Indeed, I've seen this, very appreciated. But originally I've thought in a line of kind of better optimization of rebuild implementation in current fossil without a need to go for future fossil version. I've been tempted to think into this direction by seeing excess amount of data fossil writes on rebuild and resulting repo size. > >> - commit, this is a little bit harder. One file modified and commit takes: >> real4m0.765s >> user1m55.442s >> sys 1m11.892s > > That seems like a much more important problem to solve. 4 minutes per commit > is simply *painful*, and it may happen multiple times per day, rather than > once per development box. > > Here, I occasionally see commit times of 10 seconds or so, and that’s painful > enough already. repo chksuming switching off as suggested by Nikita Borodikhin helps here a lot. The question is if to leave it to file-system or to fossil itself. Thanks, Karel ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
Hi Karel, as i understand this option, it is indeed for extra integrity checking. It checks all the files in your checkout, not only those involved in the commit and allows to find data corruption on file system. As other scms don't do that check (svn and git), I think it is safe to leave it off. Nikita. On Sat, Oct 29, 2016, 13:34 Karel Gardaswrote: > Hi Nikita, > > your advice indeed helped a lot and brings commit to 20 seconds here. > Now, the question is if I may really leave file integrity to > file-system or if even ZFS/Btrfs is not enough here and fossil does > some other magic? > > Thanks! > Karel > > On Fri, Oct 28, 2016 at 7:33 PM, Nikita Borodikhin > wrote: > > Hi Karel, > > > > I have quite a big repository (3.4G) imported from svn by a custom > tool. It > > also took several minutes to commit, and most of the time was spent in > md5 > > hash computation. It is extra precaution to ensure checkout file > integrity, > > which can be turned off with repo-cksum setting. > > > > With that setting off, it takes 4 to 6 second to commit. > ___ > fossil-users mailing list > fossil-users@lists.fossil-scm.org > http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users > ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
Hi Nikita, your advice indeed helped a lot and brings commit to 20 seconds here. Now, the question is if I may really leave file integrity to file-system or if even ZFS/Btrfs is not enough here and fossil does some other magic? Thanks! Karel On Fri, Oct 28, 2016 at 7:33 PM, Nikita Borodikhinwrote: > Hi Karel, > > I have quite a big repository (3.4G) imported from svn by a custom tool. It > also took several minutes to commit, and most of the time was spent in md5 > hash computation. It is extra precaution to ensure checkout file integrity, > which can be turned off with repo-cksum setting. > > With that setting off, it takes 4 to 6 second to commit. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
Hi Karel, I have quite a big repository (3.4G) imported from svn by a custom tool. It also took several minutes to commit, and most of the time was spent in md5 hash computation. It is extra precaution to ensure checkout file integrity, which can be turned off with repo-cksum setting. With that setting off, it takes 4 to 6 second to commit. My hardware is ext4 on Samsung 850 Pro 512 SSD, i7-3770 Nikita On Fri, Oct 28, 2016, 10:03 Warren Youngwrote: > On Oct 28, 2016, at 3:45 AM, Karel Gardas wrote: > > > > make it more scale-able and allow its real usage also for projects of > > bigger size. > > How many projects are there bigger than SQLite, percentage-wise? > > Has anyone done something like produce a SLOC histogram for all projects > on GitHub or Sourceforge, so that we can say something like, “SQLite is in > the top 2nd percentile for open source C projects based on SLOCCount’s line > counting algorithm”? > > I’m intrigued enough to want to do the project, but I don’t think I really > want to clone the entirety of GitHub onto my HDD in order to find out, even > if it’s just one project at a time. That sounds like a great way to blow > through my Comcast data cap. > > > Let's talk about some real numbers to illustrate the situation. > > Yes, let’s. :) > > > $ time /opt/fossil-head/bin/fossil clone > > http://netbsd.sonnenberger.org/ netbsd-src.fossil > > > > It takes: > > > > real323m2.323s > > user42m0.262s > > sys 13m18.003s > > Okay, but compared to what? > > If you compare to the checkout time from NetBSD’s main CVS repository, you > aren’t comparing apples to oranges, since you’re transferring only the tip > of the trunk. You have to go back to the CVS server for any history. I > suspect if you checked out each CVS revision one at a time, it would take a > lot longer than pulling the whole project history with Fossil. > > If you want to compare with some other DVCS, post those numbers. > > > rebuild alone > > takes around 250 minutes on the same hardware and with the same > > fossil. > > Would a --skip-rebuild option for fossil clone solve your major problem, > then? > > Rebuilding is strictly optional. It just makes Fossil operations run > faster post-clone. > > Also realize that cloning is a one-time activity per development machine, > for anyone active enough in the project to maintain their local clone. > > A cute option would be if --skip-rebuild would look for a local at(1) > command, then offer to schedule the rebuild for a later time, after you’ve > left off work for the day. > > Although there may be casual clients who clone, do something with the > source, throw the clone repo away when done, then clone again a year later > when they need the source again, we should not optimize Fossil for that > case. > > We’ve already discussed shallow clones, which would make Fossil more like > CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie > linked to. > > > - commit, this is a little bit harder. One file modified and commit > takes: > > real4m0.765s > > user1m55.442s > > sys 1m11.892s > > That seems like a much more important problem to solve. 4 minutes per > commit is simply *painful*, and it may happen multiple times per day, > rather than once per development box. > > Here, I occasionally see commit times of 10 seconds or so, and that’s > painful enough already. > ___ > fossil-users mailing list > fossil-users@lists.fossil-scm.org > http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users > ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
On Oct 28, 2016, at 3:45 AM, Karel Gardaswrote: > > make it more scale-able and allow its real usage also for projects of > bigger size. How many projects are there bigger than SQLite, percentage-wise? Has anyone done something like produce a SLOC histogram for all projects on GitHub or Sourceforge, so that we can say something like, “SQLite is in the top 2nd percentile for open source C projects based on SLOCCount’s line counting algorithm”? I’m intrigued enough to want to do the project, but I don’t think I really want to clone the entirety of GitHub onto my HDD in order to find out, even if it’s just one project at a time. That sounds like a great way to blow through my Comcast data cap. > Let's talk about some real numbers to illustrate the situation. Yes, let’s. :) > $ time /opt/fossil-head/bin/fossil clone > http://netbsd.sonnenberger.org/ netbsd-src.fossil > > It takes: > > real323m2.323s > user42m0.262s > sys 13m18.003s Okay, but compared to what? If you compare to the checkout time from NetBSD’s main CVS repository, you aren’t comparing apples to oranges, since you’re transferring only the tip of the trunk. You have to go back to the CVS server for any history. I suspect if you checked out each CVS revision one at a time, it would take a lot longer than pulling the whole project history with Fossil. If you want to compare with some other DVCS, post those numbers. > rebuild alone > takes around 250 minutes on the same hardware and with the same > fossil. Would a --skip-rebuild option for fossil clone solve your major problem, then? Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone. Also realize that cloning is a one-time activity per development machine, for anyone active enough in the project to maintain their local clone. A cute option would be if --skip-rebuild would look for a local at(1) command, then offer to schedule the rebuild for a later time, after you’ve left off work for the day. Although there may be casual clients who clone, do something with the source, throw the clone repo away when done, then clone again a year later when they need the source again, we should not optimize Fossil for that case. We’ve already discussed shallow clones, which would make Fossil more like CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie linked to. > - commit, this is a little bit harder. One file modified and commit takes: > real4m0.765s > user1m55.442s > sys 1m11.892s That seems like a much more important problem to solve. 4 minutes per commit is simply *painful*, and it may happen multiple times per day, rather than once per development box. Here, I occasionally see commit times of 10 seconds or so, and that’s painful enough already. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] rebuild scale-ability/data written/repo size ratio
On 28 October 2016 at 02:45, Karel Gardaswrote: > I'm just curious if there are people here tinkering with the idea to > make it more scale-able and allow its real usage also for projects of > bigger size. There has been this discussion. I have an email with the subject of "Fossil 2.1: Scaling" from March 2015. the http://www.mail-archive.com/ doesn't go back that far, though. AH, just found it on marc: http://marc.info/?l=fossil-users=144565850643439=2 -- --- inum: 883510009027723 sip: jungleboo...@sip2sip.info ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
[fossil-users] rebuild scale-ability/data written/repo size ratio
Hello, first of all, I know that Fossil was written with the idea of serving SQLite project and projects of similar size well and that it does great job in this task. I'm just curious if there are people here tinkering with the idea to make it more scale-able and allow its real usage also for projects of bigger size. Now, when git -> fossil (incremental) mirror functionality seems to be working this may be even more interesting or tempting IMHO. Let's talk about some real numbers to illustrate the situation. Let's clone NetBSD src tree kindly provided by Jörg Sonnenberger by following command: $ time /opt/fossil-head/bin/fossil clone http://netbsd.sonnenberger.org/ netbsd-src.fossil It takes: real323m2.323s user42m0.262s sys 13m18.003s on my E5-2620 Sandy Bridge workstation. Of course part of this time is spent perhaps on not so efficient network data send/receive, but majority of time at least as observed from the output of the command is spend on DB rebuild. I know that from the example of OpenBSD src tree which is comparable in size with NetBSD and where rebuild alone takes around 250 minutes on the same hardware and with the same fossil. So this is about time spent on rebuilt. What may be even more important is how much data rebuild is going to write. Here I do not have exact or perfectly exact numbers, but this is on my workstation so I see what's going on by keeping drive meters on my eyes so let's assume I'm not that off claiming that rebuild was writing data on speed ~40 MB/s for 2 or even more hours. In sum this may be around 300 GBs of data written on this rebuild (rounded up). This is for repository which final file size is: $ ls -lha netbsd-src.fossil -rw-r--r-- 1 karelkarel 2.6G Oct 28 01:11 netbsd-src.fossil and which results in the source tree of size of 2.7 GB. Now just to show that this rebuild may be the biggest scalability obstacle I'd like to compare with open/status/diff/commit operations: - open: results in 2.7GB of data written to disk in the resulting NetBSD source tree. It takes: real4m38.843s user1m44.221s sys 1m58.553s IMHO very nice result for the source tree of this size - status/diff -- one random file modified: both runs for 4-5 seconds. Also very nice results for the source tree of this size - commit, this is a little bit harder. One file modified and commit takes: real4m0.765s user1m55.442s sys 1m11.892s IMHO not so nice, but still kind of acceptable even for development on this source tree size. But certainly commit may be another target for speedup hacking. So that's it. Fossil used in those tests is: This is fossil version 1.37 [0fa60142eb] 2016-10-26 21:45:52 UTC and the tests were performed on ZFS mirror of two SSDs (1TB Crucial MX200 and 1TB Samsung 850 Evo) on Solaris 11.2 running on E5-2620 with 32GB RAM -- if anybody is interested in this info for numbers verification. Cheers, Karel ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users