Re: freebsd-update's install_verify routine excessive stating
Yoshihiro Ota wrote: Oliver Fromme wrote: It would be much better to generate two lists: - The list of hashes, as already done (filelist) - A list of gzipped files present, stripped to the hash: (cd files; echo *.gz) | tr ' ' '\n' | sed 's/\.gz$//' filespresent Note we use echo instead of ls, in order to avoid the kern.argmax limit. 64000 files would certainly exceed that limit. Also note that the output is already sorted because the shell sorts wildcard expansions. Now that we have those two files, we can use comm(1) to find out whether there are any hashes in filelist that are not in filespresent: if [ -n $(comm -23 filelist filespresent) ]; then echo -n Update files missing -- ... fi That solution scales much better because no shell loop is required at all. This will probably be the fastest. Are you sure? I'm not. Only a benchmark can answer that. See below. awk -F | ' $2 ~ /^f/{required[$7]=$7; count++} END{FS=[/.]; while(find files -name *.gz | getline0) if($2 in required) if(--count=0) exit(0); exit(count)}' $@ I think this awk solution is more difficult to read and understand, which means that it is also more prone to introduce errors. In fact, there are several problems already: First, you didn't quote the *.gz wildcard, so it will fail if there happens to be a file *.gz in the current directory. Second, the script makes assumptions about the format of the hash strings, e.g. that they don't contain dots. (Currently they only contain hex digits, AFAICT, but what if the format changes in the future.) Third, exit(count) is a bad idea, because exit codes are truncated to 8 bits. If 256 files happen to be missing, the script will seem to exit successfully. All these flaws could be fixed, of course, but it will introduce more complexity. The shell code using sort(1) and comm(1) doesn't have those flaws and appears more robust. It verifies entries using hashtable instead of sort. Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory. I am not well aware how good awk's associate array is, though. It is pretty simple. It's a hash list that starts with 50 buckets. The number of buckets is doubled (and all entries re-hashed!) when the average number of elements per bucket exceeds 2. The hash function is very simple, it's just hash = hash * 31 + c for every character c in the string, the result is modulo the current number of buckets. Note that freebsd-update uses SHA256 hashes which are fairly long (64 characters). BTW, you can easily make benchmarks. The following command will create 64000 files of the format %64x.gz. Be sure to have enough free inodes on your file system (df -i). jot -rw %08x 64000 0 20 | sed 's/.*/.gz/' | xargs touch This took 2 minutes on my notebook (3 years old). I also noticed that the first 47000 inodes were created very quickly (about 5 seconds), but then the speed dropped down suddenly to about 150 inodes per second for the rest of the two minutes. CPU was 100% system according to top. Removing the files is equally slow. So there seems to be a limit of about 47000 inodes per directory -- any more than that and it gets horribly inefficient. Therefore it would probably be a good idea to change freebsd-update to create subdirectories and distribute the files among them. For example, make 16 subdirectories [0-9a-f] and put the files into them according to the first digit of the hash. This will probably boost performance noticeably. Sorting those 64000 files is really *not* an issue. The difference between ls and ls -f is only 0.2 seconds on my notebook. When using ls -f | sort, sort(1) uses only 0.12 seconds of CPU time. This is negligible. Next I made a simple test with awk within that directory: awk 'BEGIN{while(find . -name \\*.gz | getline0)required[$1]=$1}' That script (which does only half of the required work) takes 4 seconds, reproducible. So I didn't bother to write and test the full awk solution. Bottom line: The awk solution is less robust, less readable, and much slower. Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd The most important decision in [programming] language design concerns what is to be left out. -- Niklaus Wirth ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
doubts regarding System Initialization working (SYSINIT)
Hello all, I have been browsing through the FreeBSD kernel's source code trying to understand its working . In the mi_startup() in /sys/kern/init_main.c all the SYSINIT objects are sorted using bubble sort and then they are executed in order. My doubt is that we have declared the pointer to the struct sysinit as const pointer to a const in the macro definition of SYSINIT ie when the macro SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_FIRST, kmeminit, NULL) is expanded completely we get the following static struct sysinit kmem_sys_init = { SI_SUB_KMEM, SI_ORDER_FIRST, (sysinit_cfunc_t)(sysinit_ nfunc_t)kmeminit, ((void *)(((void *)0))) }; static void const * const __set_sysinit_set_sym_kmem_sys_init __attribute__((__section__(set_ sysinit_set))) __attribute__((__used__)) = kmem_sys_init; Here we see that the pointer is of type const and to a const but when we sort and swap using *sipp=*xipp; We are trying to change the address of const pointer to a new address in which case it should segfault but it works fine. Why does it not segfault it seems I have not understood the concept behind using const *const... I will be very thankful if you can help me with it. Regards, Mehul ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: freebsd-update's install_verify routine excessive stating
Oliver Fromme wrote: Yoshihiro Ota wrote: Oliver Fromme wrote: It would be much better to generate two lists: - The list of hashes, as already done (filelist) - A list of gzipped files present, stripped to the hash: (cd files; echo *.gz) | tr ' ' '\n' | sed 's/\.gz$//' filespresent Note we use echo instead of ls, in order to avoid the kern.argmax limit. 64000 files would certainly exceed that limit. Also note that the output is already sorted because the shell sorts wildcard expansions. Now that we have those two files, we can use comm(1) to find out whether there are any hashes in filelist that are not in filespresent: if [ -n $(comm -23 filelist filespresent) ]; then echo -n Update files missing -- ... fi That solution scales much better because no shell loop is required at all. This will probably be the fastest. Are you sure? I'm not. I'd put money on this being faster for a lot of reasons. test is a builtin in our /bin/sh, so there is no exec involved for 'test -f', but going out to disk for 64k files on an individual basis should definitely be slower than getting the file list in one shot. There's no doubt that the current routine is not efficient. The cat should be eliminated, the following is equivalent: cut -f 2,7 -d '|' $@ | (quoting the $@ won't make a difference here). I haven't seen the files we're talking about, but I can't help thinking that cut | grep | cut could be streamlined. Only a benchmark can answer that. Agreed, when making changes like this you should always benchmark them. I did a lot of that when working on portmaster 2.0 which is why I have some familiarity with this issue. awk -F | ' $2 ~ /^f/{required[$7]=$7; count++} END{FS=[/.]; while(find files -name *.gz | getline0) if($2 in required) if(--count=0) exit(0); exit(count)}' $@ I think this awk solution is more difficult to read and understand, which means that it is also more prone to introduce errors. I agree, but I have only passing familiarity with awk, so to someone who knows awk this might look like hello world. :) Doug -- This .signature sanitized for your protection ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: freebsd-update's install_verify routine excessive stating
Doug Barton wrote: Oliver Fromme wrote: Yoshihiro Ota wrote: Oliver Fromme wrote: It would be much better to generate two lists: - The list of hashes, as already done (filelist) - A list of gzipped files present, stripped to the hash: (cd files; echo *.gz) | tr ' ' '\n' | sed 's/\.gz$//' filespresent Note we use echo instead of ls, in order to avoid the kern.argmax limit. 64000 files would certainly exceed that limit. Also note that the output is already sorted because the shell sorts wildcard expansions. Now that we have those two files, we can use comm(1) to find out whether there are any hashes in filelist that are not in filespresent: if [ -n $(comm -23 filelist filespresent) ]; then echo -n Update files missing -- ... fi That solution scales much better because no shell loop is required at all. This will probably be the fastest. Are you sure? I'm not. I'd put money on this being faster for a lot of reasons. I assume, with this you mean my solution to the slow shell loop problem (not quoted above), not Yoshihiro Ota's awk proposal? test is a builtin in our /bin/sh, so there is no exec involved for 'test -f', but going out to disk for 64k files on an individual basis should definitely be slower than getting the file list in one shot. Correct. There's no doubt that the current routine is not efficient. The cat should be eliminated, the following is equivalent: cut -f 2,7 -d '|' $@ | (quoting the $@ won't make a difference here). Right, technically it doesn't make a difference because the file names are fixed and don't contain spaces. *But* then it is better to use $*. Every time I see $@ without double quotes I wonder if the author forgot to add them. It always smells like a bug. Using $@ without quotes is pointless because then it behaves exactly the same as $*. I haven't seen the files we're talking about, but I can't help thinking that cut | grep | cut could be streamlined. Yes, it can. I already explained pretty much all of that (useless cat etc.) in my first post in this thread. Did you read it? My suggestion (after a small correction by Christoph Mallon) was to replace the cat|cut|grep|cut sequence with this single awk command: awk -F | '$2 ~ /^f/ {print $7}' $@ For those not fluent with awk, it means this: - Treat | as field separator. - Search for lines where the second field matches ^f (i.e. it starts with an f). - Print the 7th field of those matching lines. Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd In my experience the term transparent proxy is an oxymoron (like jumbo shrimp). Transparent proxies seem to vary from the distortions of a funhouse mirror to barely translucent. I really, really dislike them when trying to figure out the corrective lenses needed with each of them. -- R. Kevin Oberman, Network Engineer ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: freebsd-update's install_verify routine excessive stating
Oliver Fromme wrote: I assume, with this you mean my solution to the slow shell loop problem (not quoted above), not Yoshihiro Ota's awk proposal? I meant the solution using comm, sorry. (I forgot to mention that I would probably use cmp here, but that's a personal preference.) Yes, it can. I already explained pretty much all of that (useless cat etc.) in my first post in this thread. Did you read it? Yes, I was attempting to agree with you. :) My suggestion (after a small correction by Christoph Mallon) was to replace the cat|cut|grep|cut sequence with this single awk command: awk -F | '$2 ~ /^f/ {print $7}' $@ For those not fluent with awk, it means this: - Treat | as field separator. - Search for lines where the second field matches ^f (i.e. it starts with an f). - Print the 7th field of those matching lines. Like I said, I haven't seen the files, but this looks good at first blush. That said, the generation of the hash list file is just a drop in the bucket. The real inefficiency in this function is the test -f for 64k files, one at a time. Doug -- This .signature sanitized for your protection ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: freebsd-update's install_verify routine excessive stating
Doug Barton wrote: Oliver Fromme wrote: I assume, with this you mean my solution to the slow shell loop problem (not quoted above), not Yoshihiro Ota's awk proposal? I meant the solution using comm, sorry. (I forgot to mention that I would probably use cmp here, but that's a personal preference.) I see. No problem. However, I think cmp wouldn't work here, because cmp only detects whether there is a difference between two files. In this case we need to know if one file is a subset of the other: For every hash there must be a .gz file, but it doesn't hurt if there are more files. So the list of hashes can be a subset of the list of .gz files, they don't have to be equal. While I were at it, I skimmed through the cmp source and found a bug (or inefficiency): When the -s option is specified (i.e. silent, exit code only), it would be sufficient to terminate when the first difference is encountered. But it always compares the whole files. I'll try to make a patch to improve this. Yes, it can. I already explained pretty much all of that (useless cat etc.) in my first post in this thread. Did you read it? Yes, I was attempting to agree with you. :) OK, sorry. I misunderstood. :) My suggestion (after a small correction by Christoph Mallon) was to replace the cat|cut|grep|cut sequence with this single awk command: awk -F | '$2 ~ /^f/ {print $7}' $@ For those not fluent with awk, it means this: - Treat | as field separator. - Search for lines where the second field matches ^f (i.e. it starts with an f). - Print the 7th field of those matching lines. Like I said, I haven't seen the files, but this looks good at first blush. That said, the generation of the hash list file is just a drop in the bucket. The real inefficiency in this function is the test -f for 64k files, one at a time. Yes, definitely. Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd We will perhaps eventually be writing only small modules which are identi- fied by name as they are used to build larger ones, so that devices like indentation, rather than delimiters, might become feasible for expressing local structure in the source language. -- Donald E. Knuth, 1974 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: doubts regarding System Initialization working (SYSINIT)
On Friday 23 January 2009 10:55:32 am Mehul Chadha wrote: Hello all, I have been browsing through the FreeBSD kernel's source code trying to understand its working . In the mi_startup() in /sys/kern/init_main.c all the SYSINIT objects are sorted using bubble sort and then they are executed in order. My doubt is that we have declared the pointer to the struct sysinit as const pointer to a const in the macro definition of SYSINIT ie when the macro SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_FIRST, kmeminit, NULL) is expanded completely we get the following static struct sysinit kmem_sys_init = { SI_SUB_KMEM, SI_ORDER_FIRST, (sysinit_cfunc_t)(sysinit_ nfunc_t)kmeminit, ((void *)(((void *)0))) }; static void const * const __set_sysinit_set_sym_kmem_sys_init __attribute__((__section__(set_ sysinit_set))) __attribute__((__used__)) = kmem_sys_init; Here we see that the pointer is of type const and to a const but when we sort and swap using *sipp=*xipp; We are trying to change the address of const pointer to a new address in which case it should segfault but it works fine. Why does it not segfault it seems I have not understood the concept behind using const *const... I will be very thankful if you can help me with it. I'm guessing the startup code doesn't map the SYSINIT pages read only because it is not smart enough to honor that request perhaps. That is, I wouldn't be surprised if all of .rodata in the kernel was mapped as R/W instead of R/O. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: doubts regarding System Initialization working (SYSINIT)
John Baldwin wrote: On Friday 23 January 2009 10:55:32 am Mehul Chadha wrote: Hello all, I have been browsing through the FreeBSD kernel's source code trying to understand its working . In the mi_startup() in /sys/kern/init_main.c all the SYSINIT objects are sorted using bubble sort and then they are executed in order. My doubt is that we have declared the pointer to the struct sysinit as const pointer to a const in the macro definition of SYSINIT ie when the macro SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_FIRST, kmeminit, NULL) is expanded completely we get the following static struct sysinit kmem_sys_init = { SI_SUB_KMEM, SI_ORDER_FIRST, (sysinit_cfunc_t)(sysinit_ nfunc_t)kmeminit, ((void *)(((void *)0))) }; static void const * const __set_sysinit_set_sym_kmem_sys_init __attribute__((__section__(set_ sysinit_set))) __attribute__((__used__)) = kmem_sys_init; Here we see that the pointer is of type const and to a const but when we sort and swap using *sipp=*xipp; We are trying to change the address of const pointer to a new address in which case it should segfault but it works fine. Why does it not segfault it seems I have not understood the concept behind using const *const... I will be very thankful if you can help me with it. I'm guessing the startup code doesn't map the SYSINIT pages read only because it is not smart enough to honor that request perhaps. That is, I wouldn't be surprised if all of .rodata in the kernel was mapped as R/W instead of R/O. I think I have an ancient patch from someone to fix the code to not do this; let me dig for it. Sam ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: freebsd-update's install_verify routine excessive stating
Oliver, You are making a good point but connecting two different subjects. The algorithm with awk is still the fastest in theory. It uses a hash table which runs at O(c) for each comparison ASSUMING you have a good hash function that yields such result. The total number of comparison is O(n) or O(C1 * n) where C1 is some constant number based on the hash function performance. On the other hand, comparison sorts are in O(n * log(n)) comparison. The 2nd comparison of two files are at O(n). So, total comparison is still O(n * log(n)) or O(C2 * n * log(n)) where C2 is some constant based on the algorithm and implementation. That is in theory. In reality or practice, it may not be true because you could have poor C1 such that the comparison solution becomes better or because n is small enough so that it doesn't matter which solution to take. In fact, awk implementation was poor and resulted large C1 as well as slower than sort(1). Experiment is yet important so that you can determine the best algorithm in your TEST environment. This is where you are making a point. If you have control over these conditions, you can not to choose the theoretically best algorithm. I agree and had indeed expected your outcome as awk is actually known to inefficiency in its implementations. In that regard, in the practical point of view, the original implementation by cperciva@ was good enough such that searching for faster algorithms was not necessary for all people except Bert. At the end, it is up to the implementer to pick up the solution. Regards, Hiro On Fri, 23 Jan 2009 12:09:03 +0100 (CET) Oliver Fromme o...@lurza.secnetix.de wrote: Yoshihiro Ota wrote: Oliver Fromme wrote: It would be much better to generate two lists: - The list of hashes, as already done (filelist) - A list of gzipped files present, stripped to the hash: (cd files; echo *.gz) | tr ' ' '\n' | sed 's/\.gz$//' filespresent Note we use echo instead of ls, in order to avoid the kern.argmax limit. 64000 files would certainly exceed that limit. Also note that the output is already sorted because the shell sorts wildcard expansions. Now that we have those two files, we can use comm(1) to find out whether there are any hashes in filelist that are not in filespresent: if [ -n $(comm -23 filelist filespresent) ]; then echo -n Update files missing -- ... fi That solution scales much better because no shell loop is required at all. This will probably be the fastest. Are you sure? I'm not. Only a benchmark can answer that. See below. awk -F | ' $2 ~ /^f/{required[$7]=$7; count++} END{FS=[/.]; while(find files -name *.gz | getline0) if($2 in required) if(--count=0) exit(0); exit(count)}' $@ I think this awk solution is more difficult to read and understand, which means that it is also more prone to introduce errors. In fact, there are several problems already: First, you didn't quote the *.gz wildcard, so it will fail if there happens to be a file *.gz in the current directory. Second, the script makes assumptions about the format of the hash strings, e.g. that they don't contain dots. (Currently they only contain hex digits, AFAICT, but what if the format changes in the future.) Third, exit(count) is a bad idea, because exit codes are truncated to 8 bits. If 256 files happen to be missing, the script will seem to exit successfully. All these flaws could be fixed, of course, but it will introduce more complexity. The shell code using sort(1) and comm(1) doesn't have those flaws and appears more robust. It verifies entries using hashtable instead of sort. Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory. I am not well aware how good awk's associate array is, though. It is pretty simple. It's a hash list that starts with 50 buckets. The number of buckets is doubled (and all entries re-hashed!) when the average number of elements per bucket exceeds 2. The hash function is very simple, it's just hash = hash * 31 + c for every character c in the string, the result is modulo the current number of buckets. Note that freebsd-update uses SHA256 hashes which are fairly long (64 characters). BTW, you can easily make benchmarks. The following command will create 64000 files of the format %64x.gz. Be sure to have enough free inodes on your file system (df -i). jot -rw %08x 64000 0 20 | sed 's/.*/.gz/' | xargs touch This took 2 minutes on my notebook (3 years old). I also noticed that the first 47000 inodes were created very quickly (about 5 seconds), but then the speed dropped down suddenly to about 150 inodes per second for the rest of the two minutes. CPU was 100% system according to top. Removing the files is equally slow. So there seems to be a