Re: [dev] Best way to serialize data
On 2011-06-06 21:06, Džen wrote: > Pretty much answers my question. In my use case it'd be easier to use > delimiters like \0 or \n, due to the data not being binary. However now > I wonder, which method would need more cpu time? I suppose that when > using delimiters there isn't a easier way than using fgetc(), reading > through the whole data stream. Hard-coded field lengths would be faster > if the fields contain a lot of characters I guess. It would probably be easier/faster to read the whole file into a buffer ahead of time, then parse it afterwards. That way you can read in larger chunks and don't have to do a whole bunch of calls to fgetc. There have been a billion functions written to do this already so you can probably just use one of those.. Regardless of what you do, the time it takes to parse the file will probably be insignificant compared to the amount of time it takes to read the file from disk. If you use \0 as the delimiter for all your cells, then you can just use pointers into the buffer for your table because all the strings are already null-terminated. For example: char *table[rows][cols]; char *data = readfile(stdin); for (r in rows) for (c in cols) table[r][c] = data; data += strlen(data)+1; pgpbGTzenwfDx.pgp Description: PGP signature
Re: [dev] Best way to serialize data
On 07/06/11 01:14pm, hiro wrote: > You'll have to admit it sounds complicated at least :) Right, true on that -- just wasn't sure how ironic your comment was meant to be ;) -- Džen
Re: [dev] Best way to serialize data
You'll have to admit it sounds complicated at least :) On Tue, Jun 7, 2011 at 10:18, Džen wrote: > On 07/06/11 03:24am, hiro wrote: >> Interesting, you seem to be on the right track. > > ... right track to what? Utter havoc? > > >
Re: [dev] Best way to serialize data
On 07/06/11 03:24am, hiro wrote: > Interesting, you seem to be on the right track. ... right track to what? Utter havoc?
Re: [dev] Best way to serialize data
> Reason why I'm asking is because I was wondering how a dmenu-alike > utility would read data, where each items has multiple values, not > just one. Kinda like a search utility for table-structured data. Interesting, you seem to be on the right track.
Re: [dev] Best way to serialize data
On 6 June 2011 20:07, Džen wrote: > I wonder, which method would need more cpu time? I suppose that when > using delimiters there isn't a easier way than using fgetc(), reading > through the whole data stream. Hard-coded field lengths would be faster > if the fields contain a lot of characters I guess. Again, it's about the use case. For small fields you probably don't have to worry about reading inefficiencies, so use delimiters. For large fields use fixed-length or length-prefixed, for the reasons I mentioned. These two approaches are both easy to write, one is just faster to seek to the nth item, and the other is more compact, respectively. cls
Re: [dev] Best way to serialize data
Pretty much answers my question. In my use case it'd be easier to use delimiters like \0 or \n, due to the data not being binary. However now I wonder, which method would need more cpu time? I suppose that when using delimiters there isn't a easier way than using fgetc(), reading through the whole data stream. Hard-coded field lengths would be faster if the fields contain a lot of characters I guess. On 06/06/2011 20:22, Connor Lane Smith wrote: It ultimately depends on the use case. If you don't need \0 or \n in cells, your format is fine. If not, there are two approaches: As Dieter suggested, you can use fixed length fields. This is great if you have a maximum cell width, especially if this length is small or most fields use most of the space. This approach is used in, for example, tarballs' filename fields. However, if the cells dramatically vary in length, and the maximum is rather large, a better alternative is to use length-prefixing, using a number of bytes according to how large you expect your rows and cells to be: 0x000d 0x0006 "hello"\0 0x0007 "world!"\0 That is, 2-byte row length followed by two cells each with a 2-byte cell length (and I've null-terminated the strings in the example). You may need 4 or 8 bytes if your data is very long. The benefit of this is that you can check the row length and jump straight to the next row, or carry on into the row and iterate its cells. It is also completely independent of content: you can store anything. The problem with using ASCII values is you can't store binary data, and you have to check each cell's content and everything. It's a hassle; using length-prefixing is way easier. (This approach is very often used in binary protocols, such as 9P and Sam.) -- Džen
Re: [dev] Best way to serialize data
On 6 June 2011 19:22, Connor Lane Smith wrote: > 0x000d 0x0006 "hello"\0 0x0007 "world!"\0 I made a mistake here, I meant to say... 0x0011 0x0006 "hello"\0 0x0007 "world!"\0 ... including two 2-byte cell widths in the row width.
Re: [dev] Best way to serialize data
Hey, On 6 June 2011 18:19, Džen wrote: > I was wondering about which way would be the easiest/simplest to > serialize data, f.e. being read via a file or stdin (data being a > table of x rows and y columns, each cell a string). I thought of > using NULL bytes as cell delimiters and newline characters as row > delimiters. This way it wouldn't be possible to use \0 nor \n > inside the "cells", but I couldn't think of a simpler solution. It ultimately depends on the use case. If you don't need \0 or \n in cells, your format is fine. If not, there are two approaches: As Dieter suggested, you can use fixed length fields. This is great if you have a maximum cell width, especially if this length is small or most fields use most of the space. This approach is used in, for example, tarballs' filename fields. However, if the cells dramatically vary in length, and the maximum is rather large, a better alternative is to use length-prefixing, using a number of bytes according to how large you expect your rows and cells to be: 0x000d 0x0006 "hello"\0 0x0007 "world!"\0 That is, 2-byte row length followed by two cells each with a 2-byte cell length (and I've null-terminated the strings in the example). You may need 4 or 8 bytes if your data is very long. The benefit of this is that you can check the row length and jump straight to the next row, or carry on into the row and iterate its cells. It is also completely independent of content: you can store anything. The problem with using ASCII values is you can't store binary data, and you have to check each cell's content and everything. It's a hassle; using length-prefixing is way easier. (This approach is very often used in binary protocols, such as 9P and Sam.) Thanks, cls
Re: [dev] Best way to serialize data
On 06/06/2011 19:36, Douglas S. Bregolin wrote: In the ASCII table there's a "record separator" character (0x1E). At least I think is better than using '\0'. On 06/06/2011 19:42, Christoph Lohmann wrote: then why wasn't \x1C-\x1F used before for a data exchange format? To be honest, I've never really cared to look up the use of these characters, although knowing they were there. It seems to be what I've been looking for. I don't think we are the first to discover them in the ASCII table. It would be quiet neat for a simpler XML replacement. I guess that control characters in general are disliked, because people seem to overlook these bytes when using crippled text editors. -- Džen
Re: [dev] Best way to serialize data
Hello, Douglas S. Bregolin wrote: In the ASCII table there's a "record separator" character (0x1E). At least I think is better than using '\0'. then why wasn't \x1C-\x1F used before for a data exchange format? I don't think we are the first to discover them in the ASCII table. It would be quiet neat for a simpler XML replacement. Sincerely, Christoph Lohmann
Re: [dev] Best way to serialize data
On Mon, 06 Jun 2011 19:19:56 +0200 Džen wrote: > I was wondering about which way would be the easiest/simplest to > serialize data, f.e. being read via a file or stdin (data being a > table of x rows and y columns, each cell a string). I thought of > using NULL bytes as cell delimiters and newline characters as row > delimiters. This way it wouldn't be possible to use \0 nor \n > inside the "cells", but I couldn't think of a simpler solution. > > Something like: > a \0 b \0 c \n > d \0 e \0 f \n > ... > > What would you recommend? How'd you do it? http://catb.org/~esr/writings/taoup/html/ch05s02.html#id2901882 Robert Ransom signature.asc Description: PGP signature
Re: [dev] Best way to serialize data
In the ASCII table there's a "record separator" character (0x1E). At least I think is better than using '\0'. On Mon, Jun 6, 2011 at 2:19 PM, Džen wrote: > I was wondering about which way would be the easiest/simplest to > serialize data, f.e. being read via a file or stdin (data being a > table of x rows and y columns, each cell a string). I thought of > using NULL bytes as cell delimiters and newline characters as row > delimiters. This way it wouldn't be possible to use \0 nor \n > inside the "cells", but I couldn't think of a simpler solution. > > Something like: > a \0 b \0 c \n > d \0 e \0 f \n > ... > > What would you recommend? How'd you do it? > > Reason why I'm asking is because I was wondering how a dmenu-alike > utility would read data, where each items has multiple values, not > just one. Kinda like a search utility for table-structured data. > > -- > Džen > >
Re: [dev] Best way to serialize data
On Mon, 06 Jun 2011 19:19:56 +0200 Džen wrote: > I was wondering about which way would be the easiest/simplest to > serialize data, f.e. being read via a file or stdin (data being a > table of x rows and y columns, each cell a string). I thought of > using NULL bytes as cell delimiters and newline characters as row > delimiters. This way it wouldn't be possible to use \0 nor \n > inside the "cells", but I couldn't think of a simpler solution. > > Something like: > a \0 b \0 c \n > d \0 e \0 f \n > ... > > What would you recommend? How'd you do it? > > Reason why I'm asking is because I was wondering how a dmenu-alike > utility would read data, where each items has multiple values, not > just one. Kinda like a search utility for table-structured data. > the alternative is using implicit boundaries (i.e. hardcoding field lengths). you gain simplicity for the expense of space consumption. you could look how databases like mysql, berkelydb or sqlite store their tables on disk. these things are quite well thought-through. Dieter
[dev] Best way to serialize data
I was wondering about which way would be the easiest/simplest to serialize data, f.e. being read via a file or stdin (data being a table of x rows and y columns, each cell a string). I thought of using NULL bytes as cell delimiters and newline characters as row delimiters. This way it wouldn't be possible to use \0 nor \n inside the "cells", but I couldn't think of a simpler solution. Something like: a \0 b \0 c \n d \0 e \0 f \n ... What would you recommend? How'd you do it? Reason why I'm asking is because I was wondering how a dmenu-alike utility would read data, where each items has multiple values, not just one. Kinda like a search utility for table-structured data. -- Džen