Re: freebsd-update's install_verify routine excessive stating

2009-01-23 Thread Oliver Fromme

Yoshihiro Ota wrote:
  Oliver Fromme wrote:
   It would be much better to generate two lists:
- The list of hashes, as already done (filelist)
- A list of gzipped files present, stripped to the hash:
   
  (cd files; echo *.gz) |
  tr ' ' '\n' |
  sed 's/\.gz$//'  filespresent
   
   Note we use echo instead of ls, in order to avoid the
   kern.argmax limit.  64000 files would certainly exceed that
   limit.  Also note that the output is already sorted because
   the shell sorts wildcard expansions.
   
   Now that we have those two files, we can use comm(1) to
   find out whether there are any hashes in filelist that are
   not in filespresent:
   
  if [ -n $(comm -23 filelist filespresent) ]; then
  echo -n Update files missing -- 
  ...
  fi
   
   That solution scales much better because no shell loop is
   required at all.
  
  This will probably be the fastest.

Are you sure?  I'm not.
Only a benchmark can answer that.  See below.

  awk -F | '
$2 ~ /^f/{required[$7]=$7; count++}
END{FS=[/.];
 while(find files -name *.gz | getline0)
  if($2 in required)
   if(--count=0)
exit(0);
exit(count)}' $@

I think this awk solution is more difficult to read and
understand, which means that it is also more prone to
introduce errors.  In fact, there are several problems
already:

First, you didn't quote the *.gz wildcard, so it will
fail if there happens to be a file *.gz in the current
directory.

Second, the script makes assumptions about the format of
the hash strings, e.g. that they don't contain dots.
(Currently they only contain hex digits, AFAICT, but what
if the format changes in the future.)

Third, exit(count) is a bad idea, because exit codes are
truncated to 8 bits.  If 256 files happen to be missing,
the script will seem to exit successfully.

All these flaws could be fixed, of course, but it will
introduce more complexity.  The shell code using sort(1)
and comm(1) doesn't have those flaws and appears more
robust.

  It verifies entries using hashtable instead of sort.
  Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory.
  I am not well aware how good awk's associate array is, though.

It is pretty simple.  It's a hash list that starts with
50 buckets.  The number of buckets is doubled (and all
entries re-hashed!) when the average number of elements
per bucket exceeds 2.  The hash function is very simple,
it's just hash = hash * 31 + c for every character c
in the string, the result is modulo the current number
of buckets.  Note that freebsd-update uses SHA256 hashes
which are fairly long (64 characters).

BTW, you can easily make benchmarks.  The following command
will create 64000 files of the format %64x.gz.  Be sure
to have enough free inodes on your file system (df -i).

jot -rw %08x 64000 0 20 | sed 's/.*/.gz/' | xargs touch

This took 2 minutes on my notebook (3 years old).  I also
noticed that the first 47000 inodes were created very
quickly (about 5 seconds), but then the speed dropped down
suddenly to about 150 inodes per second for the rest of
the two minutes.  CPU was 100% system according to top.
Removing the files is equally slow.

So there seems to be a limit of about 47000 inodes per
directory -- any more than that and it gets horribly
inefficient.  Therefore it would probably be a good idea
to change freebsd-update to create subdirectories and
distribute the files among them.  For example, make 16
subdirectories [0-9a-f] and put the files into them
according to the first digit of the hash.  This will
probably boost performance noticeably.

Sorting those 64000 files is really *not* an issue.  The
difference between ls and ls -f is only 0.2 seconds on
my notebook.  When using ls -f | sort, sort(1) uses only
0.12 seconds of CPU time.  This is negligible.

Next I made a simple test with awk within that directory:

awk 'BEGIN{while(find . -name \\*.gz | getline0)required[$1]=$1}'

That script (which does only half of the required work)
takes 4 seconds, reproducible.  So I didn't bother to
write and test the full awk solution.

Bottom line:  The awk solution is less robust, less readable,
and much slower.

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

The most important decision in [programming] language design
concerns what is to be left out.  --  Niklaus Wirth
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


doubts regarding System Initialization working (SYSINIT)

2009-01-23 Thread Mehul Chadha
Hello all,
I have been browsing through the FreeBSD kernel's
source code trying to understand its working .

In the mi_startup() in /sys/kern/init_main.c all the SYSINIT objects
are sorted using bubble sort and then they are executed in order.

My doubt is that we have declared the pointer to the struct sysinit as
const pointer to a const in the macro definition of SYSINIT ie  when
the macro

SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_FIRST, kmeminit, NULL)  is
expanded  completely we get the following

static struct sysinit kmem_sys_init = { SI_SUB_KMEM, SI_ORDER_FIRST,
(sysinit_cfunc_t)(sysinit_
nfunc_t)kmeminit, ((void *)(((void *)0))) }; static void const * const
__set_sysinit_set_sym_kmem_sys_init __attribute__((__section__(set_
sysinit_set))) __attribute__((__used__)) = kmem_sys_init;

Here we see that the pointer is of type const and to a const but when we sort
and swap using
  *sipp=*xipp;

We are trying to change the address of const pointer to a new address
in which case it should segfault but it works fine.

Why does it not segfault it seems I have not understood the concept
behind using const *const... I will be very thankful if you can help
me with it.


Regards,
Mehul
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: freebsd-update's install_verify routine excessive stating

2009-01-23 Thread Doug Barton
Oliver Fromme wrote:
 Yoshihiro Ota wrote:
   Oliver Fromme wrote:
It would be much better to generate two lists:
 - The list of hashes, as already done (filelist)
 - A list of gzipped files present, stripped to the hash:

   (cd files; echo *.gz) |
   tr ' ' '\n' |
   sed 's/\.gz$//'  filespresent

Note we use echo instead of ls, in order to avoid the
kern.argmax limit.  64000 files would certainly exceed that
limit.  Also note that the output is already sorted because
the shell sorts wildcard expansions.

Now that we have those two files, we can use comm(1) to
find out whether there are any hashes in filelist that are
not in filespresent:

   if [ -n $(comm -23 filelist filespresent) ]; then
   echo -n Update files missing -- 
   ...
   fi

That solution scales much better because no shell loop is
required at all.
   
   This will probably be the fastest.
 
 Are you sure?  I'm not.

I'd put money on this being faster for a lot of reasons. test is a
builtin in our /bin/sh, so there is no exec involved for 'test -f',
but going out to disk for 64k files on an individual basis should
definitely be slower than getting the file list in one shot.

There's no doubt that the current routine is not efficient. The cat
should be eliminated, the following is equivalent:

cut -f 2,7 -d '|' $@ |

(quoting the $@ won't make a difference here).

I haven't seen the files we're talking about, but I can't help
thinking that cut | grep | cut could be streamlined.

 Only a benchmark can answer that. 

Agreed, when making changes like this you should always benchmark
them. I did a lot of that when working on portmaster 2.0 which is why
I have some familiarity with this issue.

   awk -F | '
 $2 ~ /^f/{required[$7]=$7; count++}
 END{FS=[/.];
  while(find files -name *.gz | getline0)
   if($2 in required)
if(--count=0)
 exit(0);
 exit(count)}' $@
 
 I think this awk solution is more difficult to read and
 understand, which means that it is also more prone to
 introduce errors. 

I agree, but I have only passing familiarity with awk, so to someone
who knows awk this might look like hello world. :)

Doug

-- 

This .signature sanitized for your protection
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: freebsd-update's install_verify routine excessive stating

2009-01-23 Thread Oliver Fromme

Doug Barton wrote:
  Oliver Fromme wrote:
   Yoshihiro Ota wrote:
Oliver Fromme wrote:
 It would be much better to generate two lists:
  - The list of hashes, as already done (filelist)
  - A list of gzipped files present, stripped to the hash:
 
(cd files; echo *.gz) |
tr ' ' '\n' |
sed 's/\.gz$//'  filespresent
 
 Note we use echo instead of ls, in order to avoid the
 kern.argmax limit.  64000 files would certainly exceed that
 limit.  Also note that the output is already sorted because
 the shell sorts wildcard expansions.
 
 Now that we have those two files, we can use comm(1) to
 find out whether there are any hashes in filelist that are
 not in filespresent:
 
if [ -n $(comm -23 filelist filespresent) ]; then
echo -n Update files missing -- 
...
fi
 
 That solution scales much better because no shell loop is
 required at all.

This will probably be the fastest.
   
   Are you sure?  I'm not.
  
  I'd put money on this being faster for a lot of reasons.

I assume, with this you mean my solution to the slow
shell loop problem (not quoted above), not Yoshihiro Ota's
awk proposal?

  test is a
  builtin in our /bin/sh, so there is no exec involved for 'test -f',
  but going out to disk for 64k files on an individual basis should
  definitely be slower than getting the file list in one shot.

Correct.

  There's no doubt that the current routine is not efficient. The cat
  should be eliminated, the following is equivalent:
  
  cut -f 2,7 -d '|' $@ |
  
  (quoting the $@ won't make a difference here).

Right, technically it doesn't make a difference because the
file names are fixed and don't contain spaces.  *But* then
it is better to use $*.  Every time I see $@ without double
quotes I wonder if the author forgot to add them.  It always
smells like a bug.  Using $@ without quotes is pointless
because then it behaves exactly the same as $*.

  I haven't seen the files we're talking about, but I can't help
  thinking that cut | grep | cut could be streamlined.

Yes, it can.  I already explained pretty much all of that
(useless cat etc.) in my first post in this thread.  Did
you read it?  My suggestion (after a small correction by
Christoph Mallon) was to replace the cat|cut|grep|cut
sequence with this single awk command:

awk -F | '$2 ~ /^f/ {print $7}' $@

For those not fluent with awk, it means this:
 - Treat | as field separator.
 - Search for lines where the second field matches ^f
   (i.e. it starts with an f).
 - Print the 7th field of those matching lines.

Best regards
Oliver

-- 
Oliver Fromme, secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

In my experience the term transparent proxy is an oxymoron (like jumbo
shrimp).  Transparent proxies seem to vary from the distortions of a
funhouse mirror to barely translucent.  I really, really dislike them
when trying to figure out the corrective lenses needed with each of them.
-- R. Kevin Oberman, Network Engineer
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: freebsd-update's install_verify routine excessive stating

2009-01-23 Thread Doug Barton
Oliver Fromme wrote:
 I assume, with this you mean my solution to the slow
 shell loop problem (not quoted above), not Yoshihiro Ota's
 awk proposal?

I meant the solution using comm, sorry. (I forgot to mention that I
would probably use cmp here, but that's a personal preference.)

 Yes, it can.  I already explained pretty much all of that
 (useless cat etc.) in my first post in this thread.  Did
 you read it? 

Yes, I was attempting to agree with you. :)

 My suggestion (after a small correction by
 Christoph Mallon) was to replace the cat|cut|grep|cut
 sequence with this single awk command:
 
 awk -F | '$2 ~ /^f/ {print $7}' $@

 For those not fluent with awk, it means this:
  - Treat | as field separator.
  - Search for lines where the second field matches ^f
(i.e. it starts with an f).
  - Print the 7th field of those matching lines.

Like I said, I haven't seen the files, but this looks good at first
blush. That said, the generation of the hash list file is just a drop
in the bucket. The real inefficiency in this function is the test -f
for 64k files, one at a time.


Doug

-- 

This .signature sanitized for your protection
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: freebsd-update's install_verify routine excessive stating

2009-01-23 Thread Oliver Fromme

Doug Barton wrote:
  Oliver Fromme wrote:
   I assume, with this you mean my solution to the slow
   shell loop problem (not quoted above), not Yoshihiro Ota's
   awk proposal?
  
  I meant the solution using comm, sorry. (I forgot to mention that I
  would probably use cmp here, but that's a personal preference.)

I see.  No problem.

However, I think cmp wouldn't work here, because cmp only
detects whether there is a difference between two files.

In this case we need to know if one file is a subset of
the other:  For every hash there must be a .gz file, but
it doesn't hurt if there are more files.  So the list of
hashes can be a subset of the list of .gz files, they
don't have to be equal.

While I were at it, I skimmed through the cmp source and
found a bug (or inefficiency):  When the -s option is
specified (i.e. silent, exit code only), it would be
sufficient to terminate when the first difference is
encountered.  But it always compares the whole files.
I'll try to make a patch to improve this.

   Yes, it can.  I already explained pretty much all of that
   (useless cat etc.) in my first post in this thread.  Did
   you read it? 
  
  Yes, I was attempting to agree with you. :)

OK, sorry.  I misunderstood.  :)

   My suggestion (after a small correction by
   Christoph Mallon) was to replace the cat|cut|grep|cut
   sequence with this single awk command:
   
   awk -F | '$2 ~ /^f/ {print $7}' $@
   
   For those not fluent with awk, it means this:
- Treat | as field separator.
- Search for lines where the second field matches ^f
  (i.e. it starts with an f).
- Print the 7th field of those matching lines.
  
  Like I said, I haven't seen the files, but this looks good at first
  blush. That said, the generation of the hash list file is just a drop
  in the bucket. The real inefficiency in this function is the test -f
  for 64k files, one at a time.

Yes, definitely.

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

We will perhaps eventually be writing only small modules which are identi-
fied by name as they are used to build larger ones, so that devices like
indentation, rather than delimiters, might become feasible for expressing
local structure in the source language. -- Donald E. Knuth, 1974
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: doubts regarding System Initialization working (SYSINIT)

2009-01-23 Thread John Baldwin
On Friday 23 January 2009 10:55:32 am Mehul Chadha wrote:
 Hello all,
 I have been browsing through the FreeBSD kernel's
 source code trying to understand its working .
 
 In the mi_startup() in /sys/kern/init_main.c all the SYSINIT objects
 are sorted using bubble sort and then they are executed in order.
 
 My doubt is that we have declared the pointer to the struct sysinit as
 const pointer to a const in the macro definition of SYSINIT ie  when
 the macro
 
 SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_FIRST, kmeminit, NULL)  is
 expanded  completely we get the following
 
 static struct sysinit kmem_sys_init = { SI_SUB_KMEM, SI_ORDER_FIRST,
 (sysinit_cfunc_t)(sysinit_
 nfunc_t)kmeminit, ((void *)(((void *)0))) }; static void const * const
 __set_sysinit_set_sym_kmem_sys_init __attribute__((__section__(set_
 sysinit_set))) __attribute__((__used__)) = kmem_sys_init;
 
 Here we see that the pointer is of type const and to a const but when we 
sort
 and swap using
   *sipp=*xipp;
 
 We are trying to change the address of const pointer to a new address
 in which case it should segfault but it works fine.
 
 Why does it not segfault it seems I have not understood the concept
 behind using const *const... I will be very thankful if you can help
 me with it.

I'm guessing the startup code doesn't map the SYSINIT pages read only because 
it is not smart enough to honor that request perhaps.  That is, I wouldn't be 
surprised if all of .rodata in the kernel was mapped as R/W instead of R/O.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: doubts regarding System Initialization working (SYSINIT)

2009-01-23 Thread Sam Leffler

John Baldwin wrote:

On Friday 23 January 2009 10:55:32 am Mehul Chadha wrote:
  

Hello all,
I have been browsing through the FreeBSD kernel's
source code trying to understand its working .

In the mi_startup() in /sys/kern/init_main.c all the SYSINIT objects
are sorted using bubble sort and then they are executed in order.

My doubt is that we have declared the pointer to the struct sysinit as
const pointer to a const in the macro definition of SYSINIT ie  when
the macro

SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_FIRST, kmeminit, NULL)  is
expanded  completely we get the following

static struct sysinit kmem_sys_init = { SI_SUB_KMEM, SI_ORDER_FIRST,
(sysinit_cfunc_t)(sysinit_
nfunc_t)kmeminit, ((void *)(((void *)0))) }; static void const * const
__set_sysinit_set_sym_kmem_sys_init __attribute__((__section__(set_
sysinit_set))) __attribute__((__used__)) = kmem_sys_init;

Here we see that the pointer is of type const and to a const but when we 


sort
  

and swap using
  *sipp=*xipp;

We are trying to change the address of const pointer to a new address
in which case it should segfault but it works fine.

Why does it not segfault it seems I have not understood the concept
behind using const *const... I will be very thankful if you can help
me with it.



I'm guessing the startup code doesn't map the SYSINIT pages read only because 
it is not smart enough to honor that request perhaps.  That is, I wouldn't be 
surprised if all of .rodata in the kernel was mapped as R/W instead of R/O.


  


I think I have an ancient patch from someone to fix the code to not do 
this; let me dig for it.


   Sam

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: freebsd-update's install_verify routine excessive stating

2009-01-23 Thread Yoshihiro Ota
Oliver,

You are making a good point but connecting two different subjects.

The algorithm with awk is still the fastest in theory.
It uses a hash table which runs at O(c) for each comparison
ASSUMING you have a good hash function that yields such result.
The total number of comparison is O(n) or O(C1 * n) where C1
is some constant number based on the hash function performance.

On the other hand, comparison sorts are in O(n * log(n))
comparison.  The 2nd comparison of two files are at O(n).
So, total comparison is still O(n * log(n)) or O(C2 * n * log(n))
where C2 is some constant based on the algorithm and implementation.

That is in theory.  In reality or practice, it may not be true because
you could have poor C1 such that the comparison solution becomes better
or because n is small enough so that it doesn't matter which solution to
take.  In fact, awk implementation was poor and resulted large C1 as well
as slower than sort(1).

Experiment is yet important so that you can determine the best
algorithm in your TEST environment.  This is where you are making a point.
If you have control over these conditions, you can not to choose the
theoretically best algorithm.  I agree and had indeed expected your
outcome as awk is actually known to inefficiency in its implementations. 

In that regard, in the practical point of view, the original implementation
by cperciva@ was good enough such that searching for faster algorithms
was not necessary for all people except Bert.

At the end, it is up to the implementer to pick up the solution.

Regards,
Hiro


On Fri, 23 Jan 2009 12:09:03 +0100 (CET)
Oliver Fromme o...@lurza.secnetix.de wrote:

 
 Yoshihiro Ota wrote:
   Oliver Fromme wrote:
It would be much better to generate two lists:
 - The list of hashes, as already done (filelist)
 - A list of gzipped files present, stripped to the hash:

   (cd files; echo *.gz) |
   tr ' ' '\n' |
   sed 's/\.gz$//'  filespresent

Note we use echo instead of ls, in order to avoid the
kern.argmax limit.  64000 files would certainly exceed that
limit.  Also note that the output is already sorted because
the shell sorts wildcard expansions.

Now that we have those two files, we can use comm(1) to
find out whether there are any hashes in filelist that are
not in filespresent:

   if [ -n $(comm -23 filelist filespresent) ]; then
   echo -n Update files missing -- 
   ...
   fi

That solution scales much better because no shell loop is
required at all.
   
   This will probably be the fastest.
 
 Are you sure?  I'm not.
 Only a benchmark can answer that.  See below.
 
   awk -F | '
 $2 ~ /^f/{required[$7]=$7; count++}
 END{FS=[/.];
  while(find files -name *.gz | getline0)
   if($2 in required)
if(--count=0)
 exit(0);
 exit(count)}' $@
 
 I think this awk solution is more difficult to read and
 understand, which means that it is also more prone to
 introduce errors.  In fact, there are several problems
 already:
 
 First, you didn't quote the *.gz wildcard, so it will
 fail if there happens to be a file *.gz in the current
 directory.
 
 Second, the script makes assumptions about the format of
 the hash strings, e.g. that they don't contain dots.
 (Currently they only contain hex digits, AFAICT, but what
 if the format changes in the future.)
 
 Third, exit(count) is a bad idea, because exit codes are
 truncated to 8 bits.  If 256 files happen to be missing,
 the script will seem to exit successfully.
 
 All these flaws could be fixed, of course, but it will
 introduce more complexity.  The shell code using sort(1)
 and comm(1) doesn't have those flaws and appears more
 robust.
 
   It verifies entries using hashtable instead of sort.
   Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory.
   I am not well aware how good awk's associate array is, though.
 
 It is pretty simple.  It's a hash list that starts with
 50 buckets.  The number of buckets is doubled (and all
 entries re-hashed!) when the average number of elements
 per bucket exceeds 2.  The hash function is very simple,
 it's just hash = hash * 31 + c for every character c
 in the string, the result is modulo the current number
 of buckets.  Note that freebsd-update uses SHA256 hashes
 which are fairly long (64 characters).
 
 BTW, you can easily make benchmarks.  The following command
 will create 64000 files of the format %64x.gz.  Be sure
 to have enough free inodes on your file system (df -i).
 
 jot -rw %08x 64000 0 20 | sed 's/.*/.gz/' | xargs touch
 
 This took 2 minutes on my notebook (3 years old).  I also
 noticed that the first 47000 inodes were created very
 quickly (about 5 seconds), but then the speed dropped down
 suddenly to about 150 inodes per second for the rest of
 the two minutes.  CPU was 100% system according to top.
 Removing the files is equally slow.
 
 So there seems to be a