Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
Chet Ramey wrote: I'm sure there are efficiency improvements possible in the bash indexed array implementation, but sequentially accessing a data structure optimized for space and sparse arrays is never going to be as fast as a read-process loop, and that difference becomes more and more apparent the larger the array. Maybe bash should remember the last position to optimize accessing the next element? There are also always hash tables, which are a bit more expensive in memory use, but would provide faster lookups (and I /really/ hope you're using a hash - or at least some kind of tree - and not a list for named-element arrays!). -- Matthew Please do not quote my e-mail address unobfuscated in message bodies. -- Anyone who is capable of getting themselves made President should on no account be allowed to do the job. -- The Hitchhiker's Guide to the Galaxy (Douglas Adams)
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
Matthew Woehlke wrote: Chet Ramey wrote: I'm sure there are efficiency improvements possible in the bash indexed array implementation, but sequentially accessing a data structure optimized for space and sparse arrays is never going to be as fast as a read-process loop, and that difference becomes more and more apparent the larger the array. Maybe bash should remember the last position to optimize accessing the next element? I already took a couple of hours and implemented something like this. It will be in the next version. Sequential access performance is dramatically improved. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
It seems that mapfile is OK for small numbers but for bigger numbers it starts to compsume time. I made a litle test: rm Xyz; unset MAPFILE # clear max= # set limit time for i in $(seq 0 $max); do echo 'Xyz' Xyz; done real0m0.490s user0m0.304s sys 0m0.124s time mapfile Xyz real0m0.005s user0m0.008s sys 0m0.000s time while read line; do echo $line /dev/null; done Xyz real0m1.124s user0m0.456s sys 0m0.108s time for i in $(seq 0 $max); do echo echo ${MAPFILE[$i]} /dev/null; done real0m2.184s user0m0.976s sys 0m0.104s rm Xyz ;unset MAPFILE max=9 time for i in $(seq 0 $max); do echo 'Xyz' Xyz; done real0m8.204s user0m3.264s sys 0m1.188s time mapfile Xyz real0m0.062s user0m0.044s sys 0m0.000s time while read line; do echo $line /dev/null; done Xyz real0m11.328s user0m4.500s sys 0m1.140s time for i in $(seq 0 $max); do echo echo ${MAPFILE[$i]} /dev/null; done real9m52.832s user5m38.305s sys 0m3.636s At the time of testing I had sufficient of free memory no swapping, and no othe time compsuming programs. 2009/3/28 Chris F.A. Johnson c...@freeshell.org On Fri, 27 Mar 2009, Lennart Schultz wrote: Chris, I agree with you to use the right tool at the right time, and mapfile seems not to be the right tool for my problem, but I will just give you some facts of my observations: using a fast tool like egrep just to find a simple string in my datafile gives the following times: time egrep 'pro' /dev/null dr.xml real0m54.628s user0m27.310s sys 0m0.036s My original bash script : time xml2e2-loadepg real1m53.264s user1m22.145s sys 0m30.674s While the questions seems to go on spawning subshells and the cost I have checked my script it is only calling one external command is date which in total is called a little less than 2 times. I have just for this test changed the call of date to an assignment of an constant. and now it looks: time xml2e2-loadepg real1m3.826s user1m2.700s sys 0m1.004s I also made the same change to the version of the program using mapfile, and changed line=$(echo $i) to line=${i##+([[:space:]])} so the mainloop is absolulty without any sub shell spawns: time xml2e2-loadepg.new real65m2.378s user63m16.717s sys 0m1.124s How much of that is taken by mapfile? Time the mapfile command and the loop separately: time mapfile file time for i in ${mapfi...@]} -- Chris F.A. Johnson, webmaster http://woodbine-gerrard.com === Author: Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
Lennart Schultz wrote: It seems that mapfile is OK for small numbers but for bigger numbers it starts to compsume time. Not exactly. Your own timing tests show that mapfile itself is blindingly fast. The time is consumed sequentially traversing the (very large) array. Bash indexed arrays are implemented as doubly-linked lists, so accessing a single element is (if I remember my combinatorics correctly, which is unlikely) O(N) instead of O(1), and accessing the entire array sequentially is O(N**N). For instance, when I factor out the command substitution, with an array with 10 elements, I get the following times: create the file: real0m47.317s user0m7.197s sys 0m10.722s read the file sequentially using a while loop: real0m9.609s user0m5.650s sys 0m3.644s mapfile: real0m0.062s user0m0.049s sys 0m0.009s accessing $MAPFILE sequentially: real1m36.880s user1m24.963s sys 0m7.161s I'm sure there are efficiency improvements possible in the bash indexed array implementation, but sequentially accessing a data structure optimized for space and sparse arrays is never going to be as fast as a read-process loop, and that difference becomes more and more apparent the larger the array. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
2009-03-26, 21:22(-04), Chet Ramey: Chris F.A. Johnson wrote: Chet, how about an option to mapfile that strips leading and/or trailing spaces? Another useful option would be to remove newlines. I'm disinclined to add one, since it's easy enough to use the ${line##[ ]} and ${line%%[]} constructs to remove leading and trailing whitespace. You can use the same expansions or pattern substitution to remove newlines (using $'\n' to denote a newline). [...] That removes only one blank, to strip all blanks, you'd need to enable ksh extended globbing (shopt -s extglob) and do ${line##+([[:blank:]])} Or POSIXly: ${line#${line%%[![:blank:]]*}} Not extremely legible. Note that read does strip leading and trailing blanks (as long as those blank characters are in IFS and as long as a variable name is provided to it), so it's not completely unreasonable to ask that readarray (aka mapfile) has an option to do that as well. -- Stéphane
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
On Thu, Mar 26, 2009 at 05:59:14PM -0400, Chris F.A. Johnson wrote: Chet, how about an option to mapfile that strips leading and/or trailing spaces? Another useful option would be to remove newlines. It already has the latter: -tRemove a trailing newline from each line read.
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
Chris, I agree with you to use the right tool at the right time, and mapfile seems not to be the right tool for my problem, but I will just give you some facts of my observations: using a fast tool like egrep just to find a simple string in my datafile gives the following times: time egrep 'pro' /dev/null dr.xml real0m54.628s user0m27.310s sys 0m0.036s My original bash script : time xml2e2-loadepg real1m53.264s user1m22.145s sys 0m30.674s While the questions seems to go on spawning subshells and the cost I have checked my script it is only calling one external command is date which in total is called a little less than 2 times. I have just for this test changed the call of date to an assignment of an constant. and now it looks: time xml2e2-loadepg real1m3.826s user1m2.700s sys 0m1.004s I also made the same change to the version of the program using mapfile, and changed line=$(echo $i) to line=${i##+([[:space:]])} so the mainloop is absolulty without any sub shell spawns: time xml2e2-loadepg.new real65m2.378s user63m16.717s sys 0m1.124s Lennart 2009/3/26 Chris F.A. Johnson c...@freeshell.org On Thu, 26 Mar 2009, Lennart Schultz wrote: I have a bash script which reads about 25 lines of xml code generating about 850 files with information extracted from the xml file. It uses the construct: while read line do case $line in done file and this takes a little less than 2 minutes Trying to use mapfile I changed the above construct to: mapfile file for i in ${mapfi...@]} do line=$(echo $i) # strip leading blanks case $line in done With this change the job now takes more than 48 minutes. :( As has already been suggested, the time it almost certainly taken up in the command substitution which you perform on every line. If you want to remove leading spaces, it would be better to use a single command to do that before reading with mapfile, e,g,: mapfile (sed 's/^ *//' file) If you want to remove trailing spaces as well: mapfile (sed -e 's/^ *//' -e 's/ *$//' file) Chet, how about an option to mapfile that strips leading and/or trailing spaces? Another useful option would be to remove newlines. -- Chris F.A. Johnson, webmaster http://woodbine-gerrard.com = Do not reply to the From: address; use Reply-To: Author: Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
On Fri, 27 Mar 2009, Lennart Schultz wrote: Chris, I agree with you to use the right tool at the right time, and mapfile seems not to be the right tool for my problem, but I will just give you some facts of my observations: using a fast tool like egrep just to find a simple string in my datafile gives the following times: time egrep 'pro' /dev/null dr.xml real0m54.628s user0m27.310s sys 0m0.036s My original bash script : time xml2e2-loadepg real1m53.264s user1m22.145s sys 0m30.674s While the questions seems to go on spawning subshells and the cost I have checked my script it is only calling one external command is date which in total is called a little less than 2 times. I have just for this test changed the call of date to an assignment of an constant. and now it looks: time xml2e2-loadepg real1m3.826s user1m2.700s sys 0m1.004s I also made the same change to the version of the program using mapfile, and changed line=$(echo $i) to line=${i##+([[:space:]])} so the mainloop is absolulty without any sub shell spawns: time xml2e2-loadepg.new real65m2.378s user63m16.717s sys 0m1.124s How much of that is taken by mapfile? Time the mapfile command and the loop separately: time mapfile file time for i in ${mapfi...@]} -- Chris F.A. Johnson, webmaster http://woodbine-gerrard.com === Author: Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
On Thu, Mar 26, 2009 at 08:53:50AM +0100, Lennart Schultz wrote: I have a bash script which reads about 25 lines of xml code generating ... mapfile file for i in ${mapfi...@]} do line=$(echo $i) # strip leading blanks case $line in done With this change the job now takes more than 48 minutes. :( Oh... new builtin. New to me anyway. A quarter of a million subshells (the $(echo) part) are probably the reason for the slowness, not the array traversal (unless holding that much data in memory is causing your system to thrash). It may be that I am new to mapfiles, and there are more efficient ways to traverse a mapfile array, but if this the case please document it. for element in ${arr...@]} for index in ${!array[*]} are probably about the same. I haven't actually benchmarked them. please introduce an option to strip leading blanks so mapfile acts like readline so constructions like: line=$(echo $i) # strip leading blanks above can be avoid. Huh... most people go out of their way to get the opposite behavior when using read. Typically, we have to throw in IFS= and -r just to get read to act the way you *don't* want. Ironic. If you want to strip leading blanks without a subshell, you can do it this way: shopt -s extglob line=${i##+([[:space:]])} However, given the way you're stating your requirements, it seems you'd actually prefer just using read: unset array i while read -r line; do array[i++]=$line done This will avoid the need to strip leading blanks yourself (read will do that), and also doesn't use any subshells.
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
Lennart Schultz wrote: Bash Version: 4.0 Patch Level: 10 Release Status: release Description: I have a bash script which reads about 25 lines of xml code generating about 850 files with information extracted from the xml file. It uses the construct: while read line do case $line in done file and this takes a little less than 2 minutes Trying to use mapfile I changed the above construct to: mapfile file for i in ${mapfi...@]} do line=$(echo $i) # strip leading blanks case $line in done With this change the job now takes more than 48 minutes. :( The most important thing is using the right tool for the job. If you have to introduce a command substitution for each line read with mapfile, you probably don't have the problem mapfile is intended to solve: quickly reading exact copies of lines from a file descriptor into an array. If another approach works better, you should use it. If you're interested in why the mapfile solution is slower, you could run the loop using a version of bash built for profiling and check where the time goes. I believe you'd find that the command substitution is responsible for much of it, and the rest is due to the significant increase in memory usage resulting from the 25-line array (which also slows down fork and process creation). Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
On Thu, 26 Mar 2009, Lennart Schultz wrote: I have a bash script which reads about 25 lines of xml code generating about 850 files with information extracted from the xml file. It uses the construct: while read line do case $line in done file and this takes a little less than 2 minutes Trying to use mapfile I changed the above construct to: mapfile file for i in ${mapfi...@]} do line=$(echo $i) # strip leading blanks case $line in done With this change the job now takes more than 48 minutes. :( As has already been suggested, the time it almost certainly taken up in the command substitution which you perform on every line. If you want to remove leading spaces, it would be better to use a single command to do that before reading with mapfile, e,g,: mapfile (sed 's/^ *//' file) If you want to remove trailing spaces as well: mapfile (sed -e 's/^ *//' -e 's/ *$//' file) Chet, how about an option to mapfile that strips leading and/or trailing spaces? Another useful option would be to remove newlines. -- Chris F.A. Johnson, webmaster http://woodbine-gerrard.com = Do not reply to the From: address; use Reply-To: Author: Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Re: using mapfile is extreamly slow compared to oldfashinod ways to read files
Chris F.A. Johnson wrote: Chet, how about an option to mapfile that strips leading and/or trailing spaces? Another useful option would be to remove newlines. I'm disinclined to add one, since it's easy enough to use the ${line##[ ]} and ${line%%[]} constructs to remove leading and trailing whitespace. You can use the same expansions or pattern substitution to remove newlines (using $'\n' to denote a newline). Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/