Re: HEAD: Deterioration in ByteString I/O

2010-09-09 Thread Simon Marlow

On 09/09/2010 00:28, Daniel Fischer wrote:

On Wednesday 08 September 2010 23:55:35, Don Stewart wrote:

simonpj:

|  ghc-6.12.3:
|89,330,672 bytes allocated in the heap
|15,092 bytes copied during GC
|35,980 bytes maximum residency (1 sample(s))
|29,556 bytes maximum slop
| 2 MB total memory in use (0 MB lost due to
|  fragmentation)
|
|  ghc-6.13.20100831:
|   475,305,720 bytes allocated in the heap
|89,272 bytes copied during GC
|68,860 bytes maximum residency (1 sample(s))
|29,444 bytes maximum slop
| 2 MB total memory in use (0 MB lost due to
|  fragmentation)
|
|  Can you put your benchmark code somewhere?  Likely a GHC
| regression.

Indeed bad. If someone could characterise the regression more
precisely (e.g. fusion isn't happening here) that would be jolly
helpful.


Shouldn't be fusion. Is this a straight IO function. Something to do
with buffering/encoding?


Maybe the following observation helps:

ghc-6.13.20100831 reads lazy ByteStrings in chunks of 8192 bytes.

If I understand correctly, that means (since defaultChunkSize = 32760)
- bytestring allocates a 32K buffer to be filled and asks ghc for 32760
bytes in that buffer
- ghc asks the OS for 8192 bytes (and usually gets them)
- upon receiving fewer bytes than requested, bytestring copies them to a
new smaller buffer
- since the number of bytes received is a multiple of ghc's allocation
block size (which I believe is 4K), there's no space for the bookkeeping
overhead, hence the new buffer takes up 12K instead of 8, resulting in 44K
allocation for 8K bytes

That factor of 5.5 corresponds pretty well with the allocation figures
above, and the extra copying explains the approximate doubling of I/O time.

Trying to find out why ghc asks the OS for only 8192 bytes instead of 32760
hasn't brought enlightenment yet.


I think I've found the problem, GHC.IO.Handle.Text:

bufReadNBEmpty :: Handle__ - Buffer Word8 - Ptr Word8 - Int - Int - 
IO Int

bufReadNBEmpty   h...@handle__{..}
 b...@buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz }
 ptr so_far count
  | count  sz, False,
Just fd - cast haDevice = do
   m - RawIO.readNonBlocking (fd::FD) ptr count
   case m of
 Nothing - return so_far
 Just n  - return (so_far + n)


See if you can spot it.  I must have made this change for debugging, and 
forgot to revert it. d'oh!


Cheers,
Simon




Cheers,
Daniel

Excerpt of strace log:

read(3, %!PS-Adobe-2.0\n%%Title: nbench\n%..., 8192) = 8192
open(/usr/lib/gconv/UTF-32.so, O_RDONLY) = 4
read(4,
\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\4\0\0004\0\0\0..., 512)
= 512
fstat64(4, {st_mode=S_IFREG|0755, st_size=9672, ...}) = 0
mmap2(NULL, 12328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) =
0xb7852000
fadvise64(4, 0, 12328, POSIX_FADV_WILLNEED) = 0
mmap2(0xb7854000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|
MAP_DENYWRITE, 4, 0x1) = 0xb7854000
close(4)= 0
mprotect(0xb7854000, 4096, PROT_READ)   = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbff37cb0) = -1 ENOTTY
(Inappropriate ioctl for device)
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, , 0) = 0
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, %!PS-Adobe-2.0\n%%Title: nbench\n%..., 8192) = 8192
read(3,  20.00 lineto\n121.153524 20, 8192) = 8192
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, , 0) = 0
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1,  20.00 lineto\n121.153524 20, 8192) = 8192
read(3, 30.542315 21.394403 lineto\n125.3..., 8192) = 8192
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, , 0) = 0
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, 30.542315 21.394403 lineto\n125.3..., 8192) = 8192
read(3, neto\n308.929337 21.969871 lineto..., 8192) = 8192
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-09 Thread Daniel Fischer
On Thursday 09 September 2010 13:19:23, Simon Marlow wrote:
 I think I've found the problem, GHC.IO.Handle.Text:

 bufReadNBEmpty :: Handle__ - Buffer Word8 - Ptr Word8 - Int - Int -
 IO Int
 bufReadNBEmpty   h...@handle__{..}
                   b...@buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz }
                   ptr so_far count
    | count  sz, False,
      Just fd - cast haDevice = do
         m - RawIO.readNonBlocking (fd::FD) ptr count
         case m of
           Nothing - return so_far
           Just n  - return (so_far + n)


 See if you can spot it.

Yes, that's it. Removing the literal False to make that branch reachable 
more or less reinstates old behaviour.

For I/O of (lazy) ByteStrings only, the allocation figures of HEAD are 
consistently slightly higher than those of 6.12.3, but the difference is 
less than 1%, well within the normal fluctuation due to changes in 
implementation. Timings seem to be identical.

When performing work on the ByteStrings (finding/replacing substrings), 
however, things change a bit.
The allocation figures observed so far range from almost identical ( 1% 
difference) to above 15% higher (90,146,508 bytes allocated vs. 
106,237,456), most of the differences I observed so far are between 5% and 
10%.
The wall clock time (elapsed, per +RTS -s or time) seems to be identical 
(very stable timings for multiple runs of the same benchmark), but the MUT 
times reported by +RTS -s differ for some benchmarks (between 10% less for 
HEAD and 20% more observed, but identical for most).

That might be worthy of examination, though it's not alarming.

Cheers,
Daniel

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-09 Thread Simon Marlow

On 09/09/2010 15:08, Daniel Fischer wrote:

On Thursday 09 September 2010 13:19:23, Simon Marlow wrote:

I think I've found the problem, GHC.IO.Handle.Text:

bufReadNBEmpty :: Handle__ -  Buffer Word8 -  Ptr Word8 -  Int -  Int -
IO Int
bufReadNBEmpty   h...@handle__{..}
   b...@buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz }
   ptr so_far count
| count  sz, False,
  Just fd- cast haDevice = do
 m- RawIO.readNonBlocking (fd::FD) ptr count
 case m of
   Nothing -  return so_far
   Just n  -  return (so_far + n)


See if you can spot it.


Yes, that's it. Removing the literal False to make that branch reachable
more or less reinstates old behaviour.

For I/O of (lazy) ByteStrings only, the allocation figures of HEAD are
consistently slightly higher than those of 6.12.3, but the difference is
less than 1%, well within the normal fluctuation due to changes in
implementation. Timings seem to be identical.

When performing work on the ByteStrings (finding/replacing substrings),
however, things change a bit.
The allocation figures observed so far range from almost identical (  1%
difference) to above 15% higher (90,146,508 bytes allocated vs.
106,237,456), most of the differences I observed so far are between 5% and
10%.
The wall clock time (elapsed, per +RTS -s or time) seems to be identical
(very stable timings for multiple runs of the same benchmark), but the MUT
times reported by +RTS -s differ for some benchmarks (between 10% less for
HEAD and 20% more observed, but identical for most).

That might be worthy of examination, though it's not alarming.


If you can find an example that consistently allocates significantly 
more than with 6.12 (10%+), then I'd say it would be worth looking into.


Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-08 Thread Don Stewart
daniel.is.fischer:
 Trying out HEAD (specifically, ghc-6.13.20100831-src.tar.bz2 built with 
 6.12.3) investigating an issue with the text package, I found that I/O of 
 ByteStrings has become significantly slower (on my machine at least:
 
 $ uname -a
 Linux linux-mkk1 2.6.27.48-0.2-pae #1 SMP 2010-07-29 20:06:52 +0200 i686 
 i686 i386 GNU/Linux
 
 Pentium 4, 3.06GHz).
 
 Timings for reading and outputting a 74.3MB file:
 
 cat:
 $ time cat bigfile  /dev/null
 0.00user 0.04system 0:00.06elapsed 83%CPU
 
 ghc-6.12.3:
 $ time ./nbench lazyBSNull bigfile a b  /dev/null
 0.01user 0.09system 0:00.10elapsed 100%CPU
 
 ghc-6.13.20100831:
 $ time ./hdbench lazyBSNull bigfile a b  /dev/null
 0.07user 0.10system 0:00.18elapsed 96%CPU
 
 In addition to the slowdown, the allocation behaviour has become quite bad:
 
 ghc-6.12.3:
   89,330,672 bytes allocated in the heap
   15,092 bytes copied during GC
   35,980 bytes maximum residency (1 sample(s))
   29,556 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
 
 ghc-6.13.20100831:
  475,305,720 bytes allocated in the heap
   89,272 bytes copied during GC
   68,860 bytes maximum residency (1 sample(s))
   29,444 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)

Can you put your benchmark code somewhere?  Likely a GHC regression.

-- Don
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-08 Thread Daniel Fischer
On Wednesday 08 September 2010 18:10:26, Don Stewart wrote:
 Can you put your benchmark code somewhere?

Boiled down to the bare minimum,

module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Lazy as L

main :: IO ()
main = do
(file : _) - getArgs
L.readFile file = L.putStr


Then all you need is a file of nontrivial size (a few 10KB is enough to 
show it).

   Likely a GHC regression.

That's what I think.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: HEAD: Deterioration in ByteString I/O

2010-09-08 Thread Simon Peyton-Jones
|   ghc-6.12.3:
| 89,330,672 bytes allocated in the heap
| 15,092 bytes copied during GC
| 35,980 bytes maximum residency (1 sample(s))
| 29,556 bytes maximum slop
|  2 MB total memory in use (0 MB lost due to fragmentation)
|  
|   ghc-6.13.20100831:
|475,305,720 bytes allocated in the heap
| 89,272 bytes copied during GC
| 68,860 bytes maximum residency (1 sample(s))
| 29,444 bytes maximum slop
|  2 MB total memory in use (0 MB lost due to fragmentation)
|  
|  Can you put your benchmark code somewhere?  Likely a GHC regression.

Indeed bad. If someone could characterise the regression more precisely (e.g. 
fusion isn't happening here) that would be jolly helpful. 

Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-08 Thread Don Stewart
simonpj:
 |   ghc-6.12.3:
 | 89,330,672 bytes allocated in the heap
 | 15,092 bytes copied during GC
 | 35,980 bytes maximum residency (1 sample(s))
 | 29,556 bytes maximum slop
 |  2 MB total memory in use (0 MB lost due to fragmentation)
 |  
 |   ghc-6.13.20100831:
 |475,305,720 bytes allocated in the heap
 | 89,272 bytes copied during GC
 | 68,860 bytes maximum residency (1 sample(s))
 | 29,444 bytes maximum slop
 |  2 MB total memory in use (0 MB lost due to fragmentation)
 |  
 |  Can you put your benchmark code somewhere?  Likely a GHC regression.
 
 Indeed bad. If someone could characterise the regression more
 precisely (e.g. fusion isn't happening here) that would be jolly
 helpful. 

Shouldn't be fusion. Is this a straight IO function. Something to do
with buffering/encoding? 
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-08 Thread Daniel Fischer
On Wednesday 08 September 2010 23:55:35, Don Stewart wrote:
 simonpj:
  |   ghc-6.12.3:
  | 89,330,672 bytes allocated in the heap
  | 15,092 bytes copied during GC
  | 35,980 bytes maximum residency (1 sample(s))
  | 29,556 bytes maximum slop
  |  2 MB total memory in use (0 MB lost due to
  |   fragmentation)
  |  
  |   ghc-6.13.20100831:
  |475,305,720 bytes allocated in the heap
  | 89,272 bytes copied during GC
  | 68,860 bytes maximum residency (1 sample(s))
  | 29,444 bytes maximum slop
  |  2 MB total memory in use (0 MB lost due to
  |   fragmentation)
  |
  |  Can you put your benchmark code somewhere?  Likely a GHC
  | regression.
 
  Indeed bad. If someone could characterise the regression more
  precisely (e.g. fusion isn't happening here) that would be jolly
  helpful.

 Shouldn't be fusion. Is this a straight IO function. Something to do
 with buffering/encoding?

Maybe the following observation helps:

ghc-6.13.20100831 reads lazy ByteStrings in chunks of 8192 bytes.

If I understand correctly, that means (since defaultChunkSize = 32760)
- bytestring allocates a 32K buffer to be filled and asks ghc for 32760 
bytes in that buffer
- ghc asks the OS for 8192 bytes (and usually gets them)
- upon receiving fewer bytes than requested, bytestring copies them to a 
new smaller buffer
- since the number of bytes received is a multiple of ghc's allocation 
block size (which I believe is 4K), there's no space for the bookkeeping 
overhead, hence the new buffer takes up 12K instead of 8, resulting in 44K 
allocation for 8K bytes

That factor of 5.5 corresponds pretty well with the allocation figures 
above, and the extra copying explains the approximate doubling of I/O time.

Trying to find out why ghc asks the OS for only 8192 bytes instead of 32760 
hasn't brought enlightenment yet.

Cheers,
Daniel

Excerpt of strace log:

read(3, %!PS-Adobe-2.0\n%%Title: nbench\n%..., 8192) = 8192
open(/usr/lib/gconv/UTF-32.so, O_RDONLY) = 4
read(4, 
\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\4\0\0004\0\0\0..., 512) 
= 512
fstat64(4, {st_mode=S_IFREG|0755, st_size=9672, ...}) = 0
mmap2(NULL, 12328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 
0xb7852000
fadvise64(4, 0, 12328, POSIX_FADV_WILLNEED) = 0
mmap2(0xb7854000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|
MAP_DENYWRITE, 4, 0x1) = 0xb7854000
close(4)= 0
mprotect(0xb7854000, 4096, PROT_READ)   = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbff37cb0) = -1 ENOTTY 
(Inappropriate ioctl for device)
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, , 0) = 0
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, %!PS-Adobe-2.0\n%%Title: nbench\n%..., 8192) = 8192
read(3,  20.00 lineto\n121.153524 20, 8192) = 8192
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, , 0) = 0
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1,  20.00 lineto\n121.153524 20, 8192) = 8192
read(3, 30.542315 21.394403 lineto\n125.3..., 8192) = 8192
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, , 0) = 0
select(2, [], [1], NULL, {0, 0})= 1 (out [1], left {0, 0})
write(1, 30.542315 21.394403 lineto\n125.3..., 8192) = 8192
read(3, neto\n308.929337 21.969871 lineto..., 8192) = 8192
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: HEAD: Deterioration in ByteString I/O

2010-09-08 Thread Daniel Fischer
On Thursday 09 September 2010 01:28:04, Daniel Fischer wrote:
 Maybe the following observation helps:

 ghc-6.13.20100831 reads lazy ByteStrings in chunks of 8192 bytes.

 If I understand correctly, that means (since defaultChunkSize = 32760)
 - bytestring allocates a 32K buffer to be filled and asks ghc for 32760
 bytes in that buffer
 - ghc asks the OS for 8192 bytes (and usually gets them)
 - upon receiving fewer bytes than requested, bytestring copies them to a
 new smaller buffer
 - since the number of bytes received is a multiple of ghc's allocation
 block size (which I believe is 4K), there's no space for the bookkeeping
 overhead, hence the new buffer takes up 12K instead of 8, resulting in
 44K allocation for 8K bytes

 That factor of 5.5 corresponds pretty well with the allocation figures
 above,

That seems to be correct, but probably not the whole story.
I've played with defaultChunkSize, setting it to (64K - overhead), ghc 
still reads in 8192 byte chunks, the allocation figures are nearly double 
those for (32K - overhead). Setting it to (8K - overhead), ghc reads in 
8184 byte chunks and the allocation figures go down to approximately 1.4 
times those of 6.12.3.
Can a factor of 1.4 be explained by the smaller chunk size or is something 
else going on?

 and the extra copying explains the approximate doubling of I/O time.

Apparently not. With the small chunk size which should avoid copying, the 
I/O didn't get faster.


 Trying to find out why ghc asks the OS for only 8192 bytes instead of
 32760 hasn't brought enlightenment yet.

No progress on that front.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users