Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-31 Thread Warren Young
On Oct 30, 2016, at 1:57 AM, Karel Gardas  wrote:
> 
> On Fri, Oct 28, 2016 at 7:02 PM, Warren Young  wrote:
>> On Oct 28, 2016, at 3:45 AM, Karel Gardas  wrote:
>>> 
>>> make it more scale-able and allow its real usage also for projects of
>>> bigger size.
>> 
>> How many projects are there bigger than SQLite, percentage-wise?
> 
> And does it really matter?

Sure it does.  Fossil is fast enough for SQLite, so if SQLite is “very large” 
compared to most other projects that could usefully use it, then speeding up 
Fossil amounts to spending effort on a tiny minority of users.

All of that is predicated on that first “if,” however.

>>> $ time /opt/fossil-head/bin/fossil clone
>>> http://netbsd.sonnenberger.org/ netbsd-src.fossil
>>> 
>>> It takes:
>>> 
>>> real323m2.323s
>>> user42m0.262s
>>> sys 13m18.003s
>> 
>> Okay, but compared to what?
> 
> For example Git, on the same source tree:
> 
> $ time git clone https://github.com/jsonn/src.git
> Cloning into 'src'...
> remote: Counting objects: 3725278, done.
> remote: Compressing objects: 100% (111/111), done.
> remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166
> Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done.
> Resolving deltas: 100% (2782525/2782525), done.
> Checking connectivity... done.
> Checking out files: 100% (176388/176388), done.
> 
> real55m20.926s
> user9m30.362s
> sys 4m50.320s

So given your other report, that rebuild takes 250 minutes of that time, then 
Fossil is within about 25% of the speed of Git, if you don’t rebuild.

>>> takes around 250 minutes on the same hardware and with the same
>>> fossil.
>> 
>> Would a --skip-rebuild option for fossil clone solve your major problem, 
>> then?
> 
> There is no such option in current fossil. Some commands (import)
> supports --no-rebuild, but clone is not among them.

I didn’t tell you to use that option, I asked if you would like that option to 
exist.

>> Rebuilding is strictly optional.  It just makes Fossil operations run faster 
>> post-clone.
> 
> That's news to me, I've thought rebuild is strictly necessary to have
> other fossil commands working

Nope.  The only reason Fossil rebuilds by default is that the clone operation 
results in a sub-optimal DB, because each cloned artifact is checked into the 
new DB separately.  You end up with a series of incremental states, none of 
which are equal to the final DB state once the clone is finished.

Rebuilding forces the SQLite instance inside Fossil to take a new look at all 
the cloned artifacts as a whole and optimize the DB for that completed 
post-clone state, rather than the series of incremental states that exist at 
each point during the clone.

> repo chksuming switching off as suggested by Nikita Borodikhin helps
> here a lot. The question is if to leave it to file-system or to fossil
> itself.

If your filesystem has strong data checksumming (as opposed to just metadata 
checksumming) then I see no reason to leave repo-cksum turned on.  Keep in mind 
that the vast majority of filesystems in common use do *not* have strong data 
checksumming, so letting repo-cksum default on is a good idea.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-30 Thread Karel Gardas
On Fri, Oct 28, 2016 at 7:02 PM, Warren Young  wrote:
> On Oct 28, 2016, at 3:45 AM, Karel Gardas  wrote:
>>
>> make it more scale-able and allow its real usage also for projects of
>> bigger size.
>
> How many projects are there bigger than SQLite, percentage-wise?

And does it really matter? The question was just question, anybody
working scaling fossil better? I've though fossil is just another
open-source DVCS where people are free to hack their ideas at least to
test some new directions.

>> $ time /opt/fossil-head/bin/fossil clone
>> http://netbsd.sonnenberger.org/ netbsd-src.fossil
>>
>> It takes:
>>
>> real323m2.323s
>> user42m0.262s
>> sys 13m18.003s
>
> Okay, but compared to what?

For example Git, on the same source tree:

$ time git clone https://github.com/jsonn/src.git
Cloning into 'src'...
remote: Counting objects: 3725278, done.
remote: Compressing objects: 100% (111/111), done.
remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166
Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done.
Resolving deltas: 100% (2782525/2782525), done.
Checking connectivity... done.
Checking out files: 100% (176388/176388), done.

real55m20.926s
user9m30.362s
sys 4m50.320s


>> rebuild alone
>> takes around 250 minutes on the same hardware and with the same
>> fossil.
>
> Would a --skip-rebuild option for fossil clone solve your major problem, then?

There is no such option in current fossil. Some commands (import)
supports --no-rebuild, but clone is not among them.

> Rebuilding is strictly optional.  It just makes Fossil operations run faster 
> post-clone.

That's news to me, I've thought rebuild is strictly necessary to have
other fossil commands working if not, hmm, allowing rebuild over night
may be indeed an option.

> We’ve already discussed shallow clones, which would make Fossil more like CVS 
> in terms of clone size.  See the “Fossil 2.0” document Mr. Boogie linked to.

Indeed, I've seen this, very appreciated. But originally I've thought
in a line of kind of better optimization of rebuild implementation in
current fossil without a need to go for future fossil version. I've
been tempted to think into this direction by seeing excess amount of
data fossil writes on rebuild and resulting repo size.

>
>> - commit, this is a little bit harder. One file modified and commit takes:
>> real4m0.765s
>> user1m55.442s
>> sys 1m11.892s
>
> That seems like a much more important problem to solve.  4 minutes per commit 
> is simply *painful*, and it may happen multiple times per day, rather than 
> once per development box.
>
> Here, I occasionally see commit times of 10 seconds or so, and that’s painful 
> enough already.

repo chksuming switching off as suggested by Nikita Borodikhin helps
here a lot. The question is if to leave it to file-system or to fossil
itself.

Thanks,
Karel
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-29 Thread Nikita Borodikhin
Hi Karel,

as i understand this option, it is indeed for extra integrity checking.  It
checks all the files in your checkout, not only those involved in the
commit and allows to find data corruption on file system.

As other scms don't do that check (svn and git), I think it is safe to
leave it off.

Nikita.


On Sat, Oct 29, 2016, 13:34 Karel Gardas  wrote:

> Hi Nikita,
>
> your advice indeed helped a lot and brings commit to 20 seconds here.
> Now, the question is if I may really leave file integrity to
> file-system or if even ZFS/Btrfs is not enough here and fossil does
> some other magic?
>
> Thanks!
> Karel
>
> On Fri, Oct 28, 2016 at 7:33 PM, Nikita Borodikhin 
> wrote:
> > Hi Karel,
> >
> > I have quite a big repository (3.4G) imported from svn by a custom
> tool.  It
> > also took several minutes to commit, and most of the time was spent in
> md5
> > hash computation.  It is extra precaution to ensure checkout file
> integrity,
> > which can be turned off with repo-cksum setting.
> >
> > With that setting off, it takes 4 to 6 second to commit.
> ___
> fossil-users mailing list
> fossil-users@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-29 Thread Karel Gardas
Hi Nikita,

your advice indeed helped a lot and brings commit to 20 seconds here.
Now, the question is if I may really leave file integrity to
file-system or if even ZFS/Btrfs is not enough here and fossil does
some other magic?

Thanks!
Karel

On Fri, Oct 28, 2016 at 7:33 PM, Nikita Borodikhin  wrote:
> Hi Karel,
>
> I have quite a big repository (3.4G) imported from svn by a custom tool.  It
> also took several minutes to commit, and most of the time was spent in md5
> hash computation.  It is extra precaution to ensure checkout file integrity,
> which can be turned off with repo-cksum setting.
>
> With that setting off, it takes 4 to 6 second to commit.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-28 Thread Nikita Borodikhin
Hi Karel,

I have quite a big repository (3.4G) imported from svn by a custom tool.
It also took several minutes to commit, and most of the time was spent in
md5 hash computation.  It is extra precaution to ensure checkout file
integrity, which can be turned off with repo-cksum setting.

With that setting off, it takes 4 to 6 second to commit.

My hardware is ext4 on Samsung 850 Pro 512 SSD, i7-3770

Nikita

On Fri, Oct 28, 2016, 10:03 Warren Young  wrote:

> On Oct 28, 2016, at 3:45 AM, Karel Gardas  wrote:
> >
> > make it more scale-able and allow its real usage also for projects of
> > bigger size.
>
> How many projects are there bigger than SQLite, percentage-wise?
>
> Has anyone done something like produce a SLOC histogram for all projects
> on GitHub or Sourceforge, so that we can say something like, “SQLite is in
> the top 2nd percentile for open source C projects based on SLOCCount’s line
> counting algorithm”?
>
> I’m intrigued enough to want to do the project, but I don’t think I really
> want to clone the entirety of GitHub onto my HDD in order to find out, even
> if it’s just one project at a time.  That sounds like a great way to blow
> through my Comcast data cap.
>
> > Let's talk about some real numbers to illustrate the situation.
>
> Yes, let’s. :)
>
> > $ time /opt/fossil-head/bin/fossil clone
> > http://netbsd.sonnenberger.org/ netbsd-src.fossil
> >
> > It takes:
> >
> > real323m2.323s
> > user42m0.262s
> > sys 13m18.003s
>
> Okay, but compared to what?
>
> If you compare to the checkout time from NetBSD’s main CVS repository, you
> aren’t comparing apples to oranges, since you’re transferring only the tip
> of the trunk.  You have to go back to the CVS server for any history.  I
> suspect if you checked out each CVS revision one at a time, it would take a
> lot longer than pulling the whole project history with Fossil.
>
> If you want to compare with some other DVCS, post those numbers.
>
> > rebuild alone
> > takes around 250 minutes on the same hardware and with the same
> > fossil.
>
> Would a --skip-rebuild option for fossil clone solve your major problem,
> then?
>
> Rebuilding is strictly optional.  It just makes Fossil operations run
> faster post-clone.
>
> Also realize that cloning is a one-time activity per development machine,
> for anyone active enough in the project to maintain their local clone.
>
> A cute option would be if --skip-rebuild would look for a local at(1)
> command, then offer to schedule the rebuild for a later time, after you’ve
> left off work for the day.
>
> Although there may be casual clients who clone, do something with the
> source, throw the clone repo away when done, then clone again a year later
> when they need the source again, we should not optimize Fossil for that
> case.
>
> We’ve already discussed shallow clones, which would make Fossil more like
> CVS in terms of clone size.  See the “Fossil 2.0” document Mr. Boogie
> linked to.
>
> > - commit, this is a little bit harder. One file modified and commit
> takes:
> > real4m0.765s
> > user1m55.442s
> > sys 1m11.892s
>
> That seems like a much more important problem to solve.  4 minutes per
> commit is simply *painful*, and it may happen multiple times per day,
> rather than once per development box.
>
> Here, I occasionally see commit times of 10 seconds or so, and that’s
> painful enough already.
> ___
> fossil-users mailing list
> fossil-users@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-28 Thread Warren Young
On Oct 28, 2016, at 3:45 AM, Karel Gardas  wrote:
> 
> make it more scale-able and allow its real usage also for projects of
> bigger size.

How many projects are there bigger than SQLite, percentage-wise?

Has anyone done something like produce a SLOC histogram for all projects on 
GitHub or Sourceforge, so that we can say something like, “SQLite is in the top 
2nd percentile for open source C projects based on SLOCCount’s line counting 
algorithm”?

I’m intrigued enough to want to do the project, but I don’t think I really want 
to clone the entirety of GitHub onto my HDD in order to find out, even if it’s 
just one project at a time.  That sounds like a great way to blow through my 
Comcast data cap.

> Let's talk about some real numbers to illustrate the situation.

Yes, let’s. :)

> $ time /opt/fossil-head/bin/fossil clone
> http://netbsd.sonnenberger.org/ netbsd-src.fossil
> 
> It takes:
> 
> real323m2.323s
> user42m0.262s
> sys 13m18.003s

Okay, but compared to what?

If you compare to the checkout time from NetBSD’s main CVS repository, you 
aren’t comparing apples to oranges, since you’re transferring only the tip of 
the trunk.  You have to go back to the CVS server for any history.  I suspect 
if you checked out each CVS revision one at a time, it would take a lot longer 
than pulling the whole project history with Fossil.

If you want to compare with some other DVCS, post those numbers.

> rebuild alone
> takes around 250 minutes on the same hardware and with the same
> fossil.

Would a --skip-rebuild option for fossil clone solve your major problem, then?

Rebuilding is strictly optional.  It just makes Fossil operations run faster 
post-clone.

Also realize that cloning is a one-time activity per development machine, for 
anyone active enough in the project to maintain their local clone.

A cute option would be if --skip-rebuild would look for a local at(1) command, 
then offer to schedule the rebuild for a later time, after you’ve left off work 
for the day.

Although there may be casual clients who clone, do something with the source, 
throw the clone repo away when done, then clone again a year later when they 
need the source again, we should not optimize Fossil for that case.

We’ve already discussed shallow clones, which would make Fossil more like CVS 
in terms of clone size.  See the “Fossil 2.0” document Mr. Boogie linked to.

> - commit, this is a little bit harder. One file modified and commit takes:
> real4m0.765s
> user1m55.442s
> sys 1m11.892s

That seems like a much more important problem to solve.  4 minutes per commit 
is simply *painful*, and it may happen multiple times per day, rather than once 
per development box.

Here, I occasionally see commit times of 10 seconds or so, and that’s painful 
enough already.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-28 Thread jungle Boogie
On 28 October 2016 at 02:45, Karel Gardas  wrote:
> I'm just curious if there are people here tinkering with the idea to
> make it more scale-able and allow its real usage also for projects of
> bigger size.


There has been this discussion. I have an email with the subject of
"Fossil 2.1: Scaling" from March 2015. the
http://www.mail-archive.com/ doesn't go back that far, though.


AH, just found it on marc:
http://marc.info/?l=fossil-users=144565850643439=2

-- 
---
inum: 883510009027723
sip: jungleboo...@sip2sip.info
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


[fossil-users] rebuild scale-ability/data written/repo size ratio

2016-10-28 Thread Karel Gardas
Hello,

first of all, I know that Fossil was written with the idea of serving
SQLite project and projects of similar size well and that it does
great job in this task.

I'm just curious if there are people here tinkering with the idea to
make it more scale-able and allow its real usage also for projects of
bigger size.

Now, when git -> fossil (incremental) mirror functionality seems to be
working this may be even more interesting or tempting IMHO.

Let's talk about some real numbers to illustrate the situation. Let's
clone NetBSD src tree kindly provided by Jörg Sonnenberger by
following command:

$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil

It takes:

real323m2.323s
user42m0.262s
sys 13m18.003s

on my E5-2620 Sandy Bridge workstation. Of course part of this time is
spent perhaps on not so efficient network data send/receive, but
majority of time at least as observed from the output of the command
is spend on DB rebuild. I know that from the example of OpenBSD src
tree which is comparable in size with NetBSD and where rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.

So this is about time spent on rebuilt. What may be even more
important is how much data rebuild is going to write. Here I do not
have exact or perfectly exact numbers, but this is on my workstation
so I see what's going on by keeping drive meters on my eyes so let's
assume I'm not that off claiming that rebuild was writing data on
speed ~40 MB/s for 2 or even more hours. In sum this may be around 300
GBs of data written on this rebuild (rounded up). This is for
repository which final file size is:

$ ls -lha netbsd-src.fossil
-rw-r--r--   1 karelkarel   2.6G Oct 28 01:11 netbsd-src.fossil

and which results in the source tree of size of 2.7 GB.

Now just to show that this rebuild may be the biggest scalability
obstacle I'd like to compare with open/status/diff/commit operations:

- open: results in 2.7GB of data written to disk in the resulting
NetBSD source tree. It takes:
real4m38.843s
user1m44.221s
sys 1m58.553s

IMHO very nice result for the source tree of this size

- status/diff -- one random file modified: both runs for 4-5 seconds.
Also very nice results for the source tree of this size

- commit, this is a little bit harder. One file modified and commit takes:
real4m0.765s
user1m55.442s
sys 1m11.892s

IMHO not so nice, but still kind of acceptable even for development on
this source tree size. But certainly commit may be another target for
speedup hacking.

So that's it. Fossil used in those tests is:

This is fossil version 1.37 [0fa60142eb] 2016-10-26 21:45:52 UTC

and the tests were performed on ZFS mirror of two SSDs (1TB Crucial
MX200 and 1TB Samsung 850 Evo) on Solaris 11.2 running on E5-2620 with
32GB RAM -- if anybody is interested in this info for numbers
verification.

Cheers,
Karel
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users