Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-04-06 Thread Matthew Woehlke

Chet Ramey wrote:

I'm sure there are efficiency improvements possible in the bash indexed
array implementation, but sequentially accessing a data structure
optimized for space and sparse arrays is never going to be as fast as
a read-process loop, and that difference becomes more and more apparent
the larger the array.


Maybe bash should remember the last position to optimize accessing the 
next element?


There are also always hash tables, which are a bit more expensive in 
memory use, but would provide faster lookups (and I /really/ hope you're 
using a hash - or at least some kind of tree - and not a list for 
named-element arrays!).


--
Matthew
Please do not quote my e-mail address unobfuscated in message bodies.
--
Anyone who is capable of getting themselves made President should on no 
account be allowed to do the job. -- The Hitchhiker's Guide to the 
Galaxy (Douglas Adams)






Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-04-06 Thread Chet Ramey
Matthew Woehlke wrote:
 Chet Ramey wrote:
 I'm sure there are efficiency improvements possible in the bash indexed
 array implementation, but sequentially accessing a data structure
 optimized for space and sparse arrays is never going to be as fast as
 a read-process loop, and that difference becomes more and more apparent
 the larger the array.
 
 Maybe bash should remember the last position to optimize accessing the
 next element?

I already took a couple of hours and implemented something like this.  It
will be in the next version.  Sequential access performance is
dramatically improved.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer

Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-28 Thread Lennart Schultz
It seems that mapfile is OK for small numbers but for bigger numbers it
starts to compsume time.

I made a litle test:

rm Xyz; unset MAPFILE # clear
max=  # set limit
time for i in $(seq 0 $max); do echo 'Xyz'  Xyz; done
real0m0.490s
user0m0.304s
sys 0m0.124s

 time mapfile  Xyz

real0m0.005s
user0m0.008s
sys 0m0.000s

time while read line; do echo $line  /dev/null; done  Xyz
real0m1.124s
user0m0.456s
sys 0m0.108s

time for i in $(seq 0 $max); do echo echo ${MAPFILE[$i]} /dev/null; done

real0m2.184s
user0m0.976s
sys 0m0.104s

rm Xyz ;unset MAPFILE
max=9

 time for i in $(seq 0 $max); do echo 'Xyz'  Xyz; done

real0m8.204s
user0m3.264s
sys 0m1.188s

time mapfile  Xyz

real0m0.062s
user0m0.044s
sys 0m0.000s

time while read line; do echo $line  /dev/null; done  Xyz
real0m11.328s
user0m4.500s
sys 0m1.140s

time for i in $(seq 0 $max); do echo echo ${MAPFILE[$i]} /dev/null; done

real9m52.832s
user5m38.305s
sys 0m3.636s


At the time of testing I had sufficient of free memory no swapping, and no
othe time compsuming programs.


2009/3/28 Chris F.A. Johnson c...@freeshell.org

 On Fri, 27 Mar 2009, Lennart Schultz wrote:

  Chris,
 I agree with you to use the right tool at the right time, and mapfile
 seems
 not to be the right tool for my problem, but I will just give you some
 facts
 of my observations:

 using a fast tool like egrep just to find a simple string in my datafile
 gives the following times:

 time egrep 'pro' /dev/null  dr.xml

 real0m54.628s
 user0m27.310s
 sys 0m0.036s

 My original bash script :

 time xml2e2-loadepg

 real1m53.264s
 user1m22.145s
 sys 0m30.674s

 While the questions seems to go on spawning subshells and the cost I have
 checked my script
 it is only calling one external command is date which in total is called a
 little less than 2 times. I have just for this test changed the call
 of
 date to an assignment of an constant. and now it looks:

 time xml2e2-loadepg

 real1m3.826s
 user1m2.700s
 sys 0m1.004s

 I also made the same change to the version of the program using mapfile,
 and
 changed  line=$(echo $i) to
 line=${i##+([[:space:]])}
 so the mainloop is absolulty without any sub shell spawns:

 time xml2e2-loadepg.new

 real65m2.378s
 user63m16.717s
 sys 0m1.124s


   How much of that is taken by mapfile? Time the mapfile command and
   the loop separately:

 time mapfile  file
 time for i in ${mapfi...@]}

 --
   Chris F.A. Johnson, webmaster http://woodbine-gerrard.com
   ===

   Author:
   Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)



Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-28 Thread Chet Ramey
Lennart Schultz wrote:
 It seems that mapfile is OK for small numbers but for bigger numbers it
 starts to compsume time.

Not exactly.  Your own timing tests show that mapfile itself is blindingly
fast.  The time is consumed sequentially traversing the (very large) array.

Bash indexed arrays are implemented as doubly-linked lists, so accessing
a single element is (if I remember my combinatorics correctly, which is
unlikely) O(N) instead of O(1), and accessing the entire array
sequentially is O(N**N).

For instance, when I factor out the command substitution, with an array
with 10 elements, I get the following times:

create the file:
real0m47.317s
user0m7.197s
sys 0m10.722s

read the file sequentially using a while loop:
real0m9.609s
user0m5.650s
sys 0m3.644s

mapfile:
real0m0.062s
user0m0.049s
sys 0m0.009s

accessing $MAPFILE sequentially:
real1m36.880s
user1m24.963s
sys 0m7.161s


I'm sure there are efficiency improvements possible in the bash indexed
array implementation, but sequentially accessing a data structure
optimized for space and sparse arrays is never going to be as fast as
a read-process loop, and that difference becomes more and more apparent
the larger the array.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer

Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-27 Thread Stephane CHAZELAS
2009-03-26, 21:22(-04), Chet Ramey:
 Chris F.A. Johnson wrote:

Chet, how about an option to mapfile that strips leading and/or
trailing spaces?
 
Another useful option would be to remove newlines.

 I'm disinclined to add one, since it's easy enough to use the
 ${line##[ ]} and ${line%%[]} constructs to remove
 leading and trailing whitespace.  You can use the same expansions
 or pattern substitution to remove newlines (using $'\n' to denote
 a newline).
[...]

That removes only one blank, to strip all blanks, you'd need to
enable ksh extended globbing (shopt -s extglob) and do

${line##+([[:blank:]])}

Or POSIXly:

${line#${line%%[![:blank:]]*}}

Not extremely legible.

Note that read does strip leading and trailing blanks (as long
as those blank characters are in IFS and as long as a variable
name is provided to it), so it's not completely unreasonable to
ask that readarray (aka mapfile) has an option to do that as
well.

-- 
Stéphane


Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-27 Thread Greg Wooledge
On Thu, Mar 26, 2009 at 05:59:14PM -0400, Chris F.A. Johnson wrote:
Chet, how about an option to mapfile that strips leading and/or
trailing spaces?
 
Another useful option would be to remove newlines.

It already has the latter:

  -tRemove a trailing newline from each line read.




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-27 Thread Lennart Schultz
Chris,
I agree with you to use the right tool at the right time, and mapfile seems
not to be the right tool for my problem, but I will just give you some facts
of my observations:

using a fast tool like egrep just to find a simple string in my datafile
gives the following times:

time egrep 'pro' /dev/null  dr.xml

real0m54.628s
user0m27.310s
sys 0m0.036s

My original bash script :

time xml2e2-loadepg

real1m53.264s
user1m22.145s
sys 0m30.674s

While the questions seems to go on spawning subshells and the cost I have
checked my script
it is only calling one external command is date which in total is called a
little less than 2 times. I have just for this test changed the call of
date to an assignment of an constant. and now it looks:

time xml2e2-loadepg

real1m3.826s
user1m2.700s
sys 0m1.004s

I also made the same change to the version of the program using mapfile, and
changed  line=$(echo $i) to
line=${i##+([[:space:]])}
so the mainloop is absolulty without any sub shell spawns:

time xml2e2-loadepg.new

real65m2.378s
user63m16.717s
sys 0m1.124s



Lennart


2009/3/26 Chris F.A. Johnson c...@freeshell.org

 On Thu, 26 Mar 2009, Lennart Schultz wrote:

  I have a bash script which reads about 25 lines of xml code generating
 about 850 files with information extracted from the xml file.
 It uses the construct:

 while read line
 do
  case $line in
  
 done  file

 and this takes a little less than 2 minutes

 Trying to use mapfile I changed the above construct to:

 mapfile   file
 for i in ${mapfi...@]}
 do
  line=$(echo $i) # strip leading blanks
  case $line in
  
 done

 With this change the job now takes more than 48 minutes. :(


   As has already been suggested, the time it almost certainly taken
   up in the command substitution which you perform on every line.

   If you want to remove leading spaces, it would be better to use a
   single command to do that before reading with mapfile, e,g,:

 mapfile  (sed 's/^ *//' file)

   If you want to remove trailing spaces as well:

 mapfile  (sed -e 's/^ *//' -e 's/ *$//' file)

   Chet, how about an option to mapfile that strips leading and/or
   trailing spaces?

   Another useful option would be to remove newlines.

 --
   Chris F.A. Johnson, webmaster http://woodbine-gerrard.com
   = Do not reply to the From: address; use Reply-To: 
   Author:
   Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)



Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-27 Thread Chris F.A. Johnson

On Fri, 27 Mar 2009, Lennart Schultz wrote:


Chris,
I agree with you to use the right tool at the right time, and mapfile seems
not to be the right tool for my problem, but I will just give you some facts
of my observations:

using a fast tool like egrep just to find a simple string in my datafile
gives the following times:

time egrep 'pro' /dev/null  dr.xml

real0m54.628s
user0m27.310s
sys 0m0.036s

My original bash script :

time xml2e2-loadepg

real1m53.264s
user1m22.145s
sys 0m30.674s

While the questions seems to go on spawning subshells and the cost I have
checked my script
it is only calling one external command is date which in total is called a
little less than 2 times. I have just for this test changed the call of
date to an assignment of an constant. and now it looks:

time xml2e2-loadepg

real1m3.826s
user1m2.700s
sys 0m1.004s

I also made the same change to the version of the program using mapfile, and
changed  line=$(echo $i) to
line=${i##+([[:space:]])}
so the mainloop is absolulty without any sub shell spawns:

time xml2e2-loadepg.new

real65m2.378s
user63m16.717s
sys 0m1.124s


   How much of that is taken by mapfile? Time the mapfile command and
   the loop separately:

time mapfile  file
time for i in ${mapfi...@]}

--
   Chris F.A. Johnson, webmaster http://woodbine-gerrard.com
   ===
   Author:
   Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-26 Thread Greg Wooledge
On Thu, Mar 26, 2009 at 08:53:50AM +0100, Lennart Schultz wrote:
 I have a bash script which reads about 25 lines of xml code generating
...

 mapfile   file
 for i in ${mapfi...@]}
 do
line=$(echo $i) # strip leading blanks
case $line in

 done
 
 With this change the job now takes more than 48 minutes. :(

Oh... new builtin.  New to me anyway.

A quarter of a million subshells (the $(echo) part) are probably the
reason for the slowness, not the array traversal (unless holding that
much data in memory is causing your system to thrash).

 It may be that I am new to mapfiles, and there are more efficient ways to
 traverse a mapfile array, but if this the case please document it.

 for element in ${arr...@]}
 for index in ${!array[*]}

are probably about the same.  I haven't actually benchmarked them.

 please introduce an option to strip leading blanks so mapfile acts like
 readline so constructions like:
 line=$(echo $i) # strip leading blanks
 above can be avoid.

Huh... most people go out of their way to get the opposite behavior
when using read.  Typically, we have to throw in IFS= and -r just
to get read to act the way you *don't* want.  Ironic.

If you want to strip leading blanks without a subshell, you can do it
this way:

 shopt -s extglob
 line=${i##+([[:space:]])}

However, given the way you're stating your requirements, it seems you'd
actually prefer just using read:

 unset array i
 while read -r line; do
   array[i++]=$line
 done

This will avoid the need to strip leading blanks yourself (read will
do that), and also doesn't use any subshells.




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-26 Thread Chet Ramey
Lennart Schultz wrote:

 Bash Version: 4.0
 Patch Level: 10
 Release Status: release
 
 Description:
 
 I have a bash script which reads about 25 lines of xml code generating
 about 850 files with information extracted from the xml file.
 It uses the construct:
 
 while read line
 do
case $line in

 done  file
 
 and this takes a little less than 2 minutes
 
 Trying to use mapfile I changed the above construct to:
 
 mapfile   file
 for i in ${mapfi...@]}
 do
line=$(echo $i) # strip leading blanks
case $line in

 done
 
 With this change the job now takes more than 48 minutes. :(

The most important thing is using the right tool for the job.  If you
have to introduce a command substitution for each line read with mapfile,
you probably don't have the problem mapfile is intended to solve:
quickly reading exact copies of lines from a file descriptor into an
array.

If another approach works better, you should use it.

If you're interested in why the mapfile solution is slower, you could
run the loop using a version of bash built for profiling and check
where the time goes.  I believe you'd find that the command substitution
is responsible for much of it, and the rest is due to the significant
increase in memory usage resulting from the 25-line array (which
also slows down fork and process creation).

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer

Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-26 Thread Chris F.A. Johnson

On Thu, 26 Mar 2009, Lennart Schultz wrote:


I have a bash script which reads about 25 lines of xml code generating
about 850 files with information extracted from the xml file.
It uses the construct:

while read line
do
  case $line in
  
done  file

and this takes a little less than 2 minutes

Trying to use mapfile I changed the above construct to:

mapfile   file
for i in ${mapfi...@]}
do
  line=$(echo $i) # strip leading blanks
  case $line in
  
done

With this change the job now takes more than 48 minutes. :(


   As has already been suggested, the time it almost certainly taken
   up in the command substitution which you perform on every line.

   If you want to remove leading spaces, it would be better to use a
   single command to do that before reading with mapfile, e,g,:

mapfile  (sed 's/^ *//' file)

   If you want to remove trailing spaces as well:

mapfile  (sed -e 's/^ *//' -e 's/ *$//' file)

   Chet, how about an option to mapfile that strips leading and/or
   trailing spaces?

   Another useful option would be to remove newlines.

--
   Chris F.A. Johnson, webmaster http://woodbine-gerrard.com
   = Do not reply to the From: address; use Reply-To: 
   Author:
   Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)




Re: using mapfile is extreamly slow compared to oldfashinod ways to read files

2009-03-26 Thread Chet Ramey
Chris F.A. Johnson wrote:

Chet, how about an option to mapfile that strips leading and/or
trailing spaces?
 
Another useful option would be to remove newlines.

I'm disinclined to add one, since it's easy enough to use the
${line##[   ]} and ${line%%[]} constructs to remove
leading and trailing whitespace.  You can use the same expansions
or pattern substitution to remove newlines (using $'\n' to denote
a newline).

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer

Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/