Re: [dev] Best way to serialize data

2011-06-08 Thread Andy Spencer
On 2011-06-06 21:06, Džen wrote:
> Pretty much answers my question. In my use case it'd be easier to use
> delimiters like \0 or \n, due to the data not being binary. However now
> I wonder, which method would need more cpu time? I suppose that when
> using delimiters there isn't a easier way than using fgetc(), reading
> through the whole data stream. Hard-coded field lengths would be faster
> if the fields contain a lot of characters I guess.

It would probably be easier/faster to read the whole file into a buffer ahead
of time, then parse it afterwards. That way you can read in larger chunks and
don't have to do a whole bunch of calls to fgetc. There have been a billion
functions written to do this already so you can probably just use one of
those..

Regardless of what you do, the time it takes to parse the file will probably be
insignificant compared to the amount of time it takes to read the file from
disk.

If you use \0 as the delimiter for all your cells, then you can just use
pointers into the buffer for your table because all the strings are already
null-terminated.

For example:
  char *table[rows][cols];
  char *data = readfile(stdin);
  for (r in rows)
  for (c in cols)
table[r][c] = data;
data += strlen(data)+1;


pgpbGTzenwfDx.pgp
Description: PGP signature


Re: [dev] Best way to serialize data

2011-06-07 Thread Džen
On 07/06/11 01:14pm, hiro wrote:
> You'll have to admit it sounds complicated at least :)

Right, true on that -- just wasn't sure how ironic your comment was
meant to be ;)

-- 
Džen



Re: [dev] Best way to serialize data

2011-06-07 Thread hiro
You'll have to admit it sounds complicated at least :)

On Tue, Jun 7, 2011 at 10:18, Džen  wrote:
> On 07/06/11 03:24am, hiro wrote:
>> Interesting, you seem to be on the right track.
>
> ... right track to what? Utter havoc?
>
>
>



Re: [dev] Best way to serialize data

2011-06-07 Thread Džen
On 07/06/11 03:24am, hiro wrote:
> Interesting, you seem to be on the right track.

... right track to what? Utter havoc?




Re: [dev] Best way to serialize data

2011-06-06 Thread hiro
> Reason why I'm asking is because I was wondering how a dmenu-alike
> utility would read data, where each items has multiple values, not
> just one. Kinda like a search utility for table-structured data.

Interesting, you seem to be on the right track.



Re: [dev] Best way to serialize data

2011-06-06 Thread Connor Lane Smith
On 6 June 2011 20:07, Džen  wrote:
> I wonder, which method would need more cpu time? I suppose that when
> using delimiters there isn't a easier way than using fgetc(), reading
> through the whole data stream. Hard-coded field lengths would be faster
> if the fields contain a lot of characters I guess.

Again, it's about the use case. For small fields you probably don't
have to worry about reading inefficiencies, so use delimiters. For
large fields use fixed-length or length-prefixed, for the reasons I
mentioned. These two approaches are both easy to write, one is just
faster to seek to the nth item, and the other is more compact,
respectively.

cls



Re: [dev] Best way to serialize data

2011-06-06 Thread Džen

Pretty much answers my question. In my use case it'd be easier to use
delimiters like \0 or \n, due to the data not being binary. However now
I wonder, which method would need more cpu time? I suppose that when
using delimiters there isn't a easier way than using fgetc(), reading
through the whole data stream. Hard-coded field lengths would be faster
if the fields contain a lot of characters I guess.

On 06/06/2011 20:22, Connor Lane Smith wrote:

It ultimately depends on the use case. If you don't need \0 or \n in
cells, your format is fine. If not, there are two approaches:

As Dieter suggested, you can use fixed length fields. This is great if
you have a maximum cell width, especially if this length is small or
most fields use most of the space. This approach is used in, for
example, tarballs' filename fields.

However, if the cells dramatically vary in length, and the maximum is
rather large, a better alternative is to use length-prefixing, using a
number of bytes according to how large you expect your rows and cells
to be:

0x000d 0x0006 "hello"\0 0x0007 "world!"\0

That is, 2-byte row length followed by two cells each with a 2-byte
cell length (and I've null-terminated the strings in the example). You
may need 4 or 8 bytes if your data is very long. The benefit of this
is that you can check the row length and jump straight to the next
row, or carry on into the row and iterate its cells. It is also
completely independent of content: you can store anything.

The problem with using ASCII values is you can't store binary data,
and you have to check each cell's content and everything. It's a
hassle; using length-prefixing is way easier.

(This approach is very often used in binary protocols, such as 9P and Sam.)


--
Džen



Re: [dev] Best way to serialize data

2011-06-06 Thread Connor Lane Smith
On 6 June 2011 19:22, Connor Lane Smith  wrote:
> 0x000d 0x0006 "hello"\0 0x0007 "world!"\0

I made a mistake here, I meant to say...

0x0011 0x0006 "hello"\0 0x0007 "world!"\0

... including two 2-byte cell widths in the row width.



Re: [dev] Best way to serialize data

2011-06-06 Thread Connor Lane Smith
Hey,

On 6 June 2011 18:19, Džen  wrote:
> I was wondering about which way would be the easiest/simplest to
> serialize data, f.e. being read via a file or stdin (data being a
> table of x rows and y columns, each cell a string). I thought of
> using NULL bytes as cell delimiters and newline characters as row
> delimiters. This way it wouldn't be possible to use \0 nor \n
> inside the "cells", but I couldn't think of a simpler solution.

It ultimately depends on the use case. If you don't need \0 or \n in
cells, your format is fine. If not, there are two approaches:

As Dieter suggested, you can use fixed length fields. This is great if
you have a maximum cell width, especially if this length is small or
most fields use most of the space. This approach is used in, for
example, tarballs' filename fields.

However, if the cells dramatically vary in length, and the maximum is
rather large, a better alternative is to use length-prefixing, using a
number of bytes according to how large you expect your rows and cells
to be:

0x000d 0x0006 "hello"\0 0x0007 "world!"\0

That is, 2-byte row length followed by two cells each with a 2-byte
cell length (and I've null-terminated the strings in the example). You
may need 4 or 8 bytes if your data is very long. The benefit of this
is that you can check the row length and jump straight to the next
row, or carry on into the row and iterate its cells. It is also
completely independent of content: you can store anything.

The problem with using ASCII values is you can't store binary data,
and you have to check each cell's content and everything. It's a
hassle; using length-prefixing is way easier.

(This approach is very often used in binary protocols, such as 9P and Sam.)

Thanks,
cls



Re: [dev] Best way to serialize data

2011-06-06 Thread Džen

On 06/06/2011 19:36, Douglas S. Bregolin wrote:

In the ASCII table there's a "record separator" character (0x1E). At
least I think is better than using '\0'.


On 06/06/2011 19:42, Christoph Lohmann wrote:

then why wasn't \x1C-\x1F used before for a data exchange format?


To be honest, I've never really cared to look up the use of these
characters, although knowing they were there. It seems to be what
I've been looking for.


I don't think we are the first to discover them in the ASCII
table. It would be quiet neat for a simpler XML replacement.


I guess that control characters in general are disliked, because
people seem to overlook these bytes when using crippled text editors.

--
Džen



Re: [dev] Best way to serialize data

2011-06-06 Thread Christoph Lohmann

Hello,

Douglas S. Bregolin wrote:

In the ASCII table there's a "record separator" character (0x1E). At
least I think is better than using '\0'.


then why wasn't \x1C-\x1F used before for a data exchange format?
I don't think we are the first to discover them in the ASCII
table. It would be quiet neat for a simpler XML replacement.


Sincerely,

Christoph Lohmann



Re: [dev] Best way to serialize data

2011-06-06 Thread Robert Ransom
On Mon, 06 Jun 2011 19:19:56 +0200
Džen  wrote:

> I was wondering about which way would be the easiest/simplest to
> serialize data, f.e. being read via a file or stdin (data being a
> table of x rows and y columns, each cell a string). I thought of
> using NULL bytes as cell delimiters and newline characters as row
> delimiters. This way it wouldn't be possible to use \0 nor \n
> inside the "cells", but I couldn't think of a simpler solution.
> 
> Something like:
> a \0 b \0 c \n
> d \0 e \0 f \n
> ...
> 
> What would you recommend? How'd you do it?

http://catb.org/~esr/writings/taoup/html/ch05s02.html#id2901882


Robert Ransom


signature.asc
Description: PGP signature


Re: [dev] Best way to serialize data

2011-06-06 Thread Douglas S. Bregolin
In the ASCII table there's a "record separator" character (0x1E). At
least I think is better than using '\0'.

On Mon, Jun 6, 2011 at 2:19 PM, Džen  wrote:
> I was wondering about which way would be the easiest/simplest to
> serialize data, f.e. being read via a file or stdin (data being a
> table of x rows and y columns, each cell a string). I thought of
> using NULL bytes as cell delimiters and newline characters as row
> delimiters. This way it wouldn't be possible to use \0 nor \n
> inside the "cells", but I couldn't think of a simpler solution.
>
> Something like:
> a \0 b \0 c \n
> d \0 e \0 f \n
> ...
>
> What would you recommend? How'd you do it?
>
> Reason why I'm asking is because I was wondering how a dmenu-alike
> utility would read data, where each items has multiple values, not
> just one. Kinda like a search utility for table-structured data.
>
> --
> Džen
>
>



Re: [dev] Best way to serialize data

2011-06-06 Thread Dieter Plaetinck
On Mon, 06 Jun 2011 19:19:56 +0200
Džen  wrote:

> I was wondering about which way would be the easiest/simplest to
> serialize data, f.e. being read via a file or stdin (data being a
> table of x rows and y columns, each cell a string). I thought of
> using NULL bytes as cell delimiters and newline characters as row
> delimiters. This way it wouldn't be possible to use \0 nor \n
> inside the "cells", but I couldn't think of a simpler solution.
> 
> Something like:
> a \0 b \0 c \n
> d \0 e \0 f \n
> ...
> 
> What would you recommend? How'd you do it?
> 
> Reason why I'm asking is because I was wondering how a dmenu-alike
> utility would read data, where each items has multiple values, not
> just one. Kinda like a search utility for table-structured data.
> 

the alternative is using implicit boundaries (i.e. hardcoding field
lengths). you gain simplicity for the expense of space consumption.

you could look how databases like mysql, berkelydb or sqlite store
their tables on disk.  these things are quite well thought-through.

Dieter 



[dev] Best way to serialize data

2011-06-06 Thread Džen

I was wondering about which way would be the easiest/simplest to
serialize data, f.e. being read via a file or stdin (data being a
table of x rows and y columns, each cell a string). I thought of
using NULL bytes as cell delimiters and newline characters as row
delimiters. This way it wouldn't be possible to use \0 nor \n
inside the "cells", but I couldn't think of a simpler solution.

Something like:
a \0 b \0 c \n
d \0 e \0 f \n
...

What would you recommend? How'd you do it?

Reason why I'm asking is because I was wondering how a dmenu-alike
utility would read data, where each items has multiple values, not
just one. Kinda like a search utility for table-structured data.

--
Džen