> On Jan. 25, 2013, 9:58 a.m., Nilay Vaish wrote: > > A double is 8 bytes, and each character in a text-based output is > > probably >= 1 byte, depending on the encoding. If the double value > > actually holds less than 8 characters, I am surprised that a > > float value does not suffice. What other info does the database > > include that is increasing its size?
The reason for switching from float to double was due to inaccuracies when formulas were recalculated. I wrote a script which takes the stats from the SQL database and injects them back into the gem5 python stats system. This allowed me to generate a text-based stats file and an SQLite database for a gem5 run, then inject the data back into the stats system and re-generate the text-based output to ensure that the stats were being stored and retrieved correctly, i.e. that the original stats.txt matched the one generated from the SQLite database. When floats were used to store the data in the database, some of the formulas evaluated to significantly different results as some of the accuracy was lost when storing. This issue was resolved when changing the storage to double as python's "float" is actually 64 bits (on most architectures/python implementations). However, in order to minimise the number of database accesses, vector stats (vector, vector2d and formulas) are stored as binary blobs in the database, thereby storing all elements of the vector in one field in the database. However, this has the side effect that if you have, for example, a vector of length 10 with one actual value and nine NaNs, you still have to store the NaNs. Naturally, if you then double the space to store each value (including the NaNs) the database becomes very large. In my view there are two alternatives to the approach in the patch: 1. Store each element for a vector in a separate table, and "reconstruct" the vector when we want the values. This has two side effects. First of all, each access requires multiple database access, or complex joining of tables which will increase the access time. Secondly, if each element is stored by itself it also need to be stored with the ID of the stat it belongs to, the index of the dump it belongs to and its position within the vector. This potentially requires more space to store than the approach in this patch. That said, it would allow only specific elements of the vector to be pulled from the database. 2. Manually pack the data into the blob field. We could only store the data which is non-NaN by manually packing the data so that we store <index within vector><value as double>. This has the advantage of only storing the data we care about (although we have the additional overhead of storing the index within the vector) and we could pull this data out with one database access. However, we do then have the overhead of packing and unpacking the data which is potentially very slow and time consuming. Personally I don't think that any of these solutions are ideal, but I think that the solution in the patch presents a fairly foolproof way of storing the data. Of course, I am more than open to suggestions, but I think it will always be a trade-off between elegance, size, speed and accuracy. - Sascha ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://reviews.gem5.org/r/1646/#review3914 ----------------------------------------------------------- On Jan. 15, 2013, 10:36 a.m., Andreas Hansson wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://reviews.gem5.org/r/1646/ > ----------------------------------------------------------- > > (Updated Jan. 15, 2013, 10:36 a.m.) > > > Review request for Default. > > > Description > ------- > > Changeset 9499:bc23f2c316fc > --------------------------- > stats: Store vector stats using doubles and compress with zlib > > This patch changes any arrays of values to be stored as an array of doubles, > rather than floats in the SQL database. This is required as floats lose too > much > accuracy. For example, if the stats are read from the database, and injected > back into gem5's stats system, then formulas can be recalculated. If floats > are > used, these formulas evaluate to be different from those originally calculated > when creating the SQL database. > > As doubles take up twice the space of a float (8 Bytes vs 4 Bytes) the SQL > database becomes larger. The end result is that the database is larger than > the > text based output without compression. Therefore, as the vector storage is > already not human readable we compress this field using zlib. zlib has been in > the python standard library since version 1.5.1. so it is already covered in > the gem5 build prerequisites. > > > Diffs > ----- > > src/python/m5/stats/sql.py PRE-CREATION > > Diff: http://reviews.gem5.org/r/1646/diff/ > > > Testing > ------- > > > Thanks, > > Andreas Hansson > > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
