Re: Reading a structured binary file?

2013-08-06 Thread H. S. Teoh
On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote:
[...]
 The only way I'm seeing to advance through the file is to keep an
 index on where you're currently reading from. This actually works
 perfect for the FileRange I mentioned in the previous post. Though
 I'm not familiar with how mmfile manages its memory, but hopefully
 there isn't buffer reuse or storing the slice could be overridden
 (not an issue for value data, but string data).

I don't know about D's Mmfile, but AFAIK, it maps directly to the OS
mmap(), which basically maps a portion of your program's address space
to the data on the disk. Meaning that the memory is managed by the OS,
and addresses will not change from under you.

In the underlying physical memory, pages may get swapped out and reused,
but this is invisible to your program, since referencing them will cause
the OS to swap the pages back in, so you'll never end up with invalid
pointers. The worst that could happen is the I/O performance hit
associated with swapping. Such is the utility of virtual memory.


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG


Re: Reading a structured binary file?

2013-08-06 Thread Jonathan M Davis
On Monday, August 05, 2013 23:04:58 H. S. Teoh wrote:
 On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote:
 [...]
 
  The only way I'm seeing to advance through the file is to keep an
  index on where you're currently reading from. This actually works
  perfect for the FileRange I mentioned in the previous post. Though
  I'm not familiar with how mmfile manages its memory, but hopefully
  there isn't buffer reuse or storing the slice could be overridden
  (not an issue for value data, but string data).
 
 I don't know about D's Mmfile, but AFAIK, it maps directly to the OS
 mmap(), which basically maps a portion of your program's address space
 to the data on the disk. Meaning that the memory is managed by the OS,
 and addresses will not change from under you.
 
 In the underlying physical memory, pages may get swapped out and reused,
 but this is invisible to your program, since referencing them will cause
 the OS to swap the pages back in, so you'll never end up with invalid
 pointers. The worst that could happen is the I/O performance hit
 associated with swapping. Such is the utility of virtual memory.

mmap is awesome. It makes handling large files _way_ easier, especially when 
you have to worry about performance. It was a huge performance boost for one 
of our video recorder programs where I work when we switched to using mmap on 
it (this device is recording multiple video streams from cameras 24/7, and 
performance is critical). Trying to do what mmap does on your own is 
incredibly bug-prone and bound to be worse for performance (since you're doing 
it instead of the kernel). One of our older products tries to do it on its own 
(probably because the developers didn't know about mmap), and it's a royal 
mess.

- Jonathan M Davis


Re: Reading a structured binary file?

2013-08-05 Thread Jesse Phillips

On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby 
wrote:
This sounds a great idea but once the file has been opened as 
a MmFile how to i convert this to a ubyte[] so the 
std.bitmanip functions work with it?


I'm currently doing this:

auto file = new MmFile(file.dat);
ubyte[] buffer = cast(ubyte[])file[];
buffer.read!uint(); //etc.

Is this how you would recommend?


You will need to slice the size of the data you want, otherwise 
you're effectively doing std.file.read(). It doesn't need to be 
for a single value (as in the example), it could be a block of 
data which is then individual parsed for the pieces.


auto file = new MmFile(file.dat);
ubyte[] buffer = cast(ubyte[])file[indexInFile..uint.sizeof];
indexInFile += uint.sizeof;
buffer.read!uint(); //etc.

The only way I'm seeing to advance through the file is to keep an 
index on where you're currently reading from. This actually works 
perfect for the FileRange I mentioned in the previous post. 
Though I'm not familiar with how mmfile manages its memory, but 
hopefully there isn't buffer reuse or storing the slice could be 
overridden (not an issue for value data, but string data).


Re: Reading a structured binary file?

2013-08-03 Thread Gary Willoughby

On Friday, 2 August 2013 at 22:13:28 UTC, Jonathan M Davis wrote:
I'd probably use std.mmfile and std.bitmanip to do it. MmFile 
will allow you to
efficiently operate on the file as a ubyte[] in memory thanks 
to mmap, and
std.bitmanip's peek and read functions make it easy to convert 
multiple bytes

into integral values.

- Jonathan M Davis


This sounds a great idea but once the file has been opened as a 
MmFile how to i convert this to a ubyte[] so the std.bitmanip 
functions work with it?


Re: Reading a structured binary file?

2013-08-03 Thread Gary Willoughby

On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
This sounds a great idea but once the file has been opened as a 
MmFile how to i convert this to a ubyte[] so the std.bitmanip 
functions work with it?


I'm currently doing this:

auto file = new MmFile(file.dat);
ubyte[] buffer = cast(ubyte[])file[];
buffer.read!uint(); //etc.

Is this how you would recommend?


Re: Reading a structured binary file?

2013-08-03 Thread Jonathan M Davis
On Saturday, August 03, 2013 20:23:55 Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
  This sounds a great idea but once the file has been opened as a
  MmFile how to i convert this to a ubyte[] so the std.bitmanip
  functions work with it?
 
 I'm currently doing this:
 
   auto file = new MmFile(file.dat);
   ubyte[] buffer = cast(ubyte[])file[];
   buffer.read!uint(); //etc.
 
 Is this how you would recommend?

Yeah. That's how you do it.

- Jonathan M Davis


Re: Reading a structured binary file?

2013-08-03 Thread John Colvin

On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby 
wrote:
This sounds a great idea but once the file has been opened as 
a MmFile how to i convert this to a ubyte[] so the 
std.bitmanip functions work with it?


I'm currently doing this:

auto file = new MmFile(file.dat);
ubyte[] buffer = cast(ubyte[])file[];
buffer.read!uint(); //etc.

Is this how you would recommend?


That defeats the object of memory mapping, as the [] at the end 
of cast(ubyte[])file[] implies copying the whole file in to 
memory.


3 options I can think of:
1) copy read from std.bitmanip and modify it to work nicely with 
MmFile

2) write a wrapper for MmFile to let it work nicely with read
3) rewrite/modify MmFile

I would love to do 3) at some point, but I'm too busy at the 
moment.


Re: Reading a structured binary file?

2013-08-03 Thread Jonathan M Davis
On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
 On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
  On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
  
  wrote:
  This sounds a great idea but once the file has been opened as
  a MmFile how to i convert this to a ubyte[] so the
  std.bitmanip functions work with it?
  
  I'm currently doing this:
  auto file = new MmFile(file.dat);
  ubyte[] buffer = cast(ubyte[])file[];
  buffer.read!uint(); //etc.
  
  Is this how you would recommend?
 
 That defeats the object of memory mapping, as the [] at the end
 of cast(ubyte[])file[] implies copying the whole file in to
 memory.

Are you sure about that? Maybe I'm just not familiar enough with mmap, but I 
don't see anything in MmFile which would result in it copying the whole file 
into memory. I guess that I'll have to do some more reading up on mmap. 
Certainly, if slicing it like that copies it all into memory, that's a big 
problem.

- Jonathan M Davis


Re: Reading a structured binary file?

2013-08-03 Thread monarch_dodra

On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:

On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
[...]

FWIW
i have to deal with big data files that can be a few GB. for 
some data
analysis software i wrote in C a while back i did some testing 
with
caching and such. turns out that for Win7-64 the automatic 
caching
done by the OS is really good and any attempt to speed things 
up
actually slowed it down. no kidding, i have seen more than 2GB 
of data
being automatically cached. of course the system RAM must be 
larger
than the file size (if i remember my tests correctly by a 
factor of
~2, but this is maybe not a linear relationship, i did not 
actually
change the RAM just the size of the data file) and it will 
hold it in

the cache only as long as there are no concurrent applications
requiring RAM or caching. i guess my point is, if your target 
is Win7
and your files are 5x smaller than the installed RAM i would 
not
bother at all trying to optimize file access. i suppose -nix 
machine

will do a similar good job these days.

[...]

IIRC, Linux has been caching files (or disk blocks, rather) in 
memory
since the days of Win95. Of course, memory in those days was 
much
scarcer, but file sizes were smaller too. :) There's still a 
cost to
copy the kernel buffers into userspace, though, which should 
not be
disregarded. But if you use mmap, then you're essentially 
accessing that

memory cache directly, which is as good as it gets.

I don't know how well mmap works on windows, though, IIRC it 
doesn't
have the same semantics as Posix, so you could accidentally run 
into

performance issues by using it the wrong way on windows.


T


I did some benching a while back with user bioinfornatics. He had 
to do some pretty large file reads, preferably in very little 
time. Observations showed my algo was *much* faster under windows 
then linux.


What we observed is that under windows, as soon as you open a 
file for reading, windows starts buffering the file in a parallel 
thread.


What we did was create two threads. The first did nothing but 
read the file, store it into chunks of memory, and then pass it 
to a worker thread. The worker thread did the parsing proper.


Doing this *halved* the linux runtime, tying it with the 
monothreaded windows run time. Windows saw no change.


FYI, the full thread is here:
forum.dlang.org/thread/gmfqwzgtjfnqiajgh...@forum.dlang.org


Re: Reading a structured binary file?

2013-08-03 Thread H. S. Teoh
On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:
 On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
  On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
   On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
   
   wrote:
   This sounds a great idea but once the file has been opened as
   a MmFile how to i convert this to a ubyte[] so the
   std.bitmanip functions work with it?
   
   I'm currently doing this:
 auto file = new MmFile(file.dat);
 ubyte[] buffer = cast(ubyte[])file[];
 buffer.read!uint(); //etc.
   
   Is this how you would recommend?
  
  That defeats the object of memory mapping, as the [] at the end
  of cast(ubyte[])file[] implies copying the whole file in to
  memory.
 
 Are you sure about that? Maybe I'm just not familiar enough with mmap,
 but I don't see anything in MmFile which would result in it copying
 the whole file into memory. I guess that I'll have to do some more
 reading up on mmap.  Certainly, if slicing it like that copies it all
 into memory, that's a big problem.
[...]

I think he meant that the OS will have to load the entire file into
memory if you sliced the mmap'ed file, not that you'll copy all the
data.

I'm not certain this is true, though, because slicing as I understand it
only returns the address of the start of the mmap'ed addresses coupled
with its length. I don't think the OS will actually load anything into
memory until you reference an address within that mmap'ed range. And
even then, only those disk blocks that correspond with the referenced
addresses will actually be loaded -- this is the point of virtual
memory, after all.


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5


Re: Reading a structured binary file?

2013-08-03 Thread H. S. Teoh
On Sat, Aug 03, 2013 at 11:29:01PM +0200, monarch_dodra wrote:
 On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:
 On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
 [...]
 FWIW
 i have to deal with big data files that can be a few GB. for some
 data analysis software i wrote in C a while back i did some testing
 with caching and such. turns out that for Win7-64 the automatic
 caching done by the OS is really good and any attempt to speed
 things up actually slowed it down. no kidding, i have seen more than
 2GB of data being automatically cached. of course the system RAM
 must be larger than the file size (if i remember my tests correctly
 by a factor of ~2, but this is maybe not a linear relationship, i
 did not actually change the RAM just the size of the data file) and
 it will hold it in the cache only as long as there are no concurrent
 applications requiring RAM or caching. i guess my point is, if your
 target is Win7 and your files are 5x smaller than the installed RAM
 i would not bother at all trying to optimize file access. i suppose
 -nix machine will do a similar good job these days.
 [...]
 
 IIRC, Linux has been caching files (or disk blocks, rather) in memory
 since the days of Win95. Of course, memory in those days was much
 scarcer, but file sizes were smaller too. :) There's still a cost to
 copy the kernel buffers into userspace, though, which should not be
 disregarded. But if you use mmap, then you're essentially accessing
 that memory cache directly, which is as good as it gets.
 
 I don't know how well mmap works on windows, though, IIRC it doesn't
 have the same semantics as Posix, so you could accidentally run into
 performance issues by using it the wrong way on windows.
[...]
 I did some benching a while back with user bioinfornatics. He had to
 do some pretty large file reads, preferably in very little time.
 Observations showed my algo was *much* faster under windows then
 linux.

Sorry, I lost the context of this discussion, what algo are you
referring to?


 What we observed is that under windows, as soon as you open a file
 for reading, windows starts buffering the file in a parallel thread.
 
 What we did was create two threads. The first did nothing but read
 the file, store it into chunks of memory, and then pass it to a
 worker thread. The worker thread did the parsing proper.
 
 Doing this *halved* the linux runtime, tying it with the
 monothreaded windows run time. Windows saw no change.

Interesting. I wonder if you could, under Linux, mmap a file then have
one thread access the first byte of each file block while another thread
does the real work with the data.


 FYI, the full thread is here:
 forum.dlang.org/thread/gmfqwzgtjfnqiajgh...@forum.dlang.org

I'll take a look, thanks.


T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it 
often, use it unsparingly, to subdue your listeners into submission!


Re: Reading a structured binary file?

2013-08-03 Thread Jonathan M Davis
On Saturday, August 03, 2013 14:31:16 H. S. Teoh wrote:
 On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:
  On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
   On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby

wrote:
This sounds a great idea but once the file has been opened as
a MmFile how to i convert this to a ubyte[] so the
std.bitmanip functions work with it?

I'm currently doing this:
auto file = new MmFile(file.dat);
ubyte[] buffer = cast(ubyte[])file[];
buffer.read!uint(); //etc.

Is this how you would recommend?
   
   That defeats the object of memory mapping, as the [] at the end
   of cast(ubyte[])file[] implies copying the whole file in to
   memory.
  
  Are you sure about that? Maybe I'm just not familiar enough with mmap,
  but I don't see anything in MmFile which would result in it copying
  the whole file into memory. I guess that I'll have to do some more
  reading up on mmap.  Certainly, if slicing it like that copies it all
  into memory, that's a big problem.
 
 [...]
 
 I think he meant that the OS will have to load the entire file into
 memory if you sliced the mmap'ed file, not that you'll copy all the
 data.
 
 I'm not certain this is true, though, because slicing as I understand it
 only returns the address of the start of the mmap'ed addresses coupled
 with its length. I don't think the OS will actually load anything into
 memory until you reference an address within that mmap'ed range. And
 even then, only those disk blocks that correspond with the referenced
 addresses will actually be loaded -- this is the point of virtual
 memory, after all.

That's what I thought that mmap did, but it's not something that I've studied 
in detail.

Aside from that though, my main complaint about MmFile is the fact that it's a 
class when it really should be a reference-counted struct. At some point, we 
should probably create MMFile or somesuch which _is_ a reference counted 
struct and then deprecate MmFile. But if we do that, then we should be sure of 
whatever other changes the implementation needs and do those with it.

- Jonathan M Davis


Reading a structured binary file?

2013-08-02 Thread Gary Willoughby
What library commands do i use to read from a structured binary 
file? I want to read the byte stream 1, 2 maybe 4 bytes at a time 
and cast these to bytes, shorts and ints respectively. I can't 
seem to find anything like readByte().


Re: Reading a structured binary file?

2013-08-02 Thread Dicebot

On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:
What library commands do i use to read from a structured binary 
file? I want to read the byte stream 1, 2 maybe 4 bytes at a 
time and cast these to bytes, shorts and ints respectively. I 
can't seem to find anything like readByte().


http://dlang.org/phobos/std_file.html#.read
http://dlang.org/phobos/std_stdio.html#.File.rawRead

?


Re: Reading a structured binary file?

2013-08-02 Thread Justin Whear
On Fri, 02 Aug 2013 19:49:54 +0200, Gary Willoughby wrote:

 What library commands do i use to read from a structured binary file? I
 want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these
 to bytes, shorts and ints respectively. I can't seem to find anything
 like readByte().

You can use File.rawRead:

ushort[1] myShort;
file.rawRead(myShort);

Or if you have structures in the file:

struct Foo
{
align(1):
int bar;
short k;
char[7] str;
}
Foo[1] foo;
file.rawRead(foo);


Re: Reading a structured binary file?

2013-08-02 Thread John Colvin

On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:
What library commands do i use to read from a structured binary 
file? I want to read the byte stream 1, 2 maybe 4 bytes at a 
time and cast these to bytes, shorts and ints respectively. I 
can't seem to find anything like readByte().


How big is the file?

If it's not too huge i'd just read it in with std.file.read and 
then sort out splitting it up from there.


Re: Reading a structured binary file?

2013-08-02 Thread Gary Willoughby

How big is the file?

If it's not too huge i'd just read it in with std.file.read and 
then sort out splitting it up from there.


Quite large so i'll probably stream it. Thanks guys.


Re: Reading a structured binary file?

2013-08-02 Thread Jesse Phillips

On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:
What library commands do i use to read from a structured binary 
file? I want to read the byte stream 1, 2 maybe 4 bytes at a 
time and cast these to bytes, shorts and ints respectively. I 
can't seem to find anything like readByte().


You've gotten some help already around functions D provides. But 
I thought I would mention I'd recently tried to do some large 
file parsing for binary data, and decided to try and blog about 
it.


http://he-the-great.livejournal.com/47550.html

I can't say this is the best solution, but it worked. I was 
parsing a 20 gig OpenStreetMap planet file.


Re: Reading a structured binary file?

2013-08-02 Thread Jonathan M Davis
On Friday, August 02, 2013 19:49:54 Gary Willoughby wrote:
 What library commands do i use to read from a structured binary
 file? I want to read the byte stream 1, 2 maybe 4 bytes at a time
 and cast these to bytes, shorts and ints respectively. I can't
 seem to find anything like readByte().

I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to 
efficiently operate on the file as a ubyte[] in memory thanks to mmap, and 
std.bitmanip's peek and read functions make it easy to convert multiple bytes 
into integral values.

- Jonathan M Davis


Re: Reading a structured binary file?

2013-08-02 Thread captaindet

On 2013-08-02 17:13, Jonathan M Davis wrote:

On Friday, August 02, 2013 19:49:54 Gary Willoughby wrote:

What library commands do i use to read from a structured binary
file? I want to read the byte stream 1, 2 maybe 4 bytes at a time
and cast these to bytes, shorts and ints respectively. I can't
seem to find anything like readByte().


I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to
efficiently operate on the file as a ubyte[] in memory thanks to mmap, and
std.bitmanip's peek and read functions make it easy to convert multiple bytes
into integral values.

- Jonathan M Davis


FWIW
i have to deal with big data files that can be a few GB. for some data analysis 
software i wrote in C a while back i did some testing with caching and such. turns 
out that for Win7-64 the automatic caching done by the OS is really good and any 
attempt to speed things up actually slowed it down. no kidding, i have seen more 
than 2GB of data being automatically cached. of course the system RAM must be 
larger than the file size (if i remember my tests correctly by a factor of ~2, but 
this is maybe not a linear relationship, i did not actually change the RAM just 
the size of the data file) and it will hold it in the cache only as long as there 
are no concurrent applications requiring RAM or caching. i guess my point is, if 
your target is Win7 and your files are 5x smaller than the installed RAM i 
would not bother at all trying to optimize file access. i suppose -nix machine 
will do a similar good job these days.

/det


Re: Reading a structured binary file?

2013-08-02 Thread H. S. Teoh
On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
[...]
 FWIW
 i have to deal with big data files that can be a few GB. for some data
 analysis software i wrote in C a while back i did some testing with
 caching and such. turns out that for Win7-64 the automatic caching
 done by the OS is really good and any attempt to speed things up
 actually slowed it down. no kidding, i have seen more than 2GB of data
 being automatically cached. of course the system RAM must be larger
 than the file size (if i remember my tests correctly by a factor of
 ~2, but this is maybe not a linear relationship, i did not actually
 change the RAM just the size of the data file) and it will hold it in
 the cache only as long as there are no concurrent applications
 requiring RAM or caching. i guess my point is, if your target is Win7
 and your files are 5x smaller than the installed RAM i would not
 bother at all trying to optimize file access. i suppose -nix machine
 will do a similar good job these days.
[...]

IIRC, Linux has been caching files (or disk blocks, rather) in memory
since the days of Win95. Of course, memory in those days was much
scarcer, but file sizes were smaller too. :) There's still a cost to
copy the kernel buffers into userspace, though, which should not be
disregarded. But if you use mmap, then you're essentially accessing that
memory cache directly, which is as good as it gets.

I don't know how well mmap works on windows, though, IIRC it doesn't
have the same semantics as Posix, so you could accidentally run into
performance issues by using it the wrong way on windows.


T

-- 
There is no gravity. The earth sucks.