> On Jan. 25, 2013, 9:58 a.m., Nilay Vaish wrote: > > A double is 8 bytes, and each character in a text-based output is > > probably >= 1 byte, depending on the encoding. If the double value > > actually holds less than 8 characters, I am surprised that a > > float value does not suffice. What other info does the database > > include that is increasing its size? > > Sascha Bischoff wrote: > The reason for switching from float to double was due to inaccuracies > when formulas were recalculated. > > I wrote a script which takes the stats from the SQL database and injects > them back into the gem5 python stats system. This allowed me to generate a > text-based stats file and an SQLite database for a gem5 run, then inject the > data back into the stats system and re-generate the text-based output to > ensure that the stats were being stored and retrieved correctly, i.e. that > the original stats.txt matched the one generated from the SQLite database. > When floats were used to store the data in the database, some of the formulas > evaluated to significantly different results as some of the accuracy was lost > when storing. This issue was resolved when changing the storage to double as > python's "float" is actually 64 bits (on most architectures/python > implementations). > > However, in order to minimise the number of database accesses, vector > stats (vector, vector2d and formulas) are stored as binary blobs in the > database, thereby storing all elements of the vector in one field in the > database. However, this has the side effect that if you have, for example, a > vector of length 10 with one actual value and nine NaNs, you still have to > store the NaNs. Naturally, if you then double the space to store each value > (including the NaNs) the database becomes very large. > > In my view there are two alternatives to the approach in the patch: > > 1. Store each element for a vector in a separate table, and "reconstruct" > the vector when we want the values. This has two side effects. First of all, > each access requires multiple database access, or complex joining of tables > which will increase the access time. Secondly, if each element is stored by > itself it also need to be stored with the ID of the stat it belongs to, the > index of the dump it belongs to and its position within the vector. This > potentially requires more space to store than the approach in this patch. > That said, it would allow only specific elements of the vector to be pulled > from the database. > > 2. Manually pack the data into the blob field. We could only store the > data which is non-NaN by manually packing the data so that we store <index > within vector><value as double>. This has the advantage of only storing the > data we care about (although we have the additional overhead of storing the > index within the vector) and we could pull this data out with one database > access. However, we do then have the overhead of packing and unpacking the > data which is potentially very slow and time consuming. > > Personally I don't think that any of these solutions are ideal, but I > think that the solution in the patch presents a fairly foolproof way of > storing the data. Of course, I am more than open to suggestions, but I think > it will always be a trade-off between elegance, size, speed and accuracy.
The second approach is what I would personally prefer. It is pretty common to store sparse matrices / vectors that way. Note that even compression is 'slow and time consuming'. But I'll let you decide the approach you want to take. - Nilay ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://reviews.gem5.org/r/1646/#review3914 ----------------------------------------------------------- On Jan. 15, 2013, 10:36 a.m., Andreas Hansson wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://reviews.gem5.org/r/1646/ > ----------------------------------------------------------- > > (Updated Jan. 15, 2013, 10:36 a.m.) > > > Review request for Default. > > > Description > ------- > > Changeset 9499:bc23f2c316fc > --------------------------- > stats: Store vector stats using doubles and compress with zlib > > This patch changes any arrays of values to be stored as an array of doubles, > rather than floats in the SQL database. This is required as floats lose too > much > accuracy. For example, if the stats are read from the database, and injected > back into gem5's stats system, then formulas can be recalculated. If floats > are > used, these formulas evaluate to be different from those originally calculated > when creating the SQL database. > > As doubles take up twice the space of a float (8 Bytes vs 4 Bytes) the SQL > database becomes larger. The end result is that the database is larger than > the > text based output without compression. Therefore, as the vector storage is > already not human readable we compress this field using zlib. zlib has been in > the python standard library since version 1.5.1. so it is already covered in > the gem5 build prerequisites. > > > Diffs > ----- > > src/python/m5/stats/sql.py PRE-CREATION > > Diff: http://reviews.gem5.org/r/1646/diff/ > > > Testing > ------- > > > Thanks, > > Andreas Hansson > > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
