Re: [freenet-dev] Disk I/O thread

Matthew Toseland Tue, 11 Sep 2012 11:56:53 -0700

On Friday 07 Sep 2012 00:43:44 postwall-free...@yahoo.de wrote:
> >Freetalk and WoT may be better designed on that level. However they 
> produce more disk I/O. And the reason for this is, mainstream database 
> practice and theory are designed for two cases:
> >1. Lots of data (absolute amount and transactions/sec) on fast, reliable 
> >hardware with professional sysadmins.
> >2. Tiny amounts of data on commodity hardware.
> 
> >If
>  you have lots of data on commodity disks, it falls down. And note that 
> mainstream use of databases in the second case frequently fails. For 
> example, I have a huge set of bookmarks in a tree on one of my browsers.
>  Every time I create a new category, another one gets renamed.
> 
> Thats not a database error, thats an application one for sure. The reason 
> databases fail on standard consumer hardware is because these systems often 
> write buffer on various levels, undermining the whole ACID architecture. This 
> leads to half-applied transactions or simply database corruption on power 
> failure or even just os crashes. The results are random, though the standard 
> one being a corrupt db which just doesn't get back up, any replicable 
> behaviour isn't likely to be related to this. 
> 
> Apart from that databases of any size will run reliably on any hardware as 
> long as you ensure fsync does what it is supposed to do and don't 
> intentionally or unintentionally disable any of the transaction-based safety 
> features. Performance will suffer on consumer hardware, greatly so if the 
> indices are too big to be cached, but identical stuff runs slower on slower 
> hardware, no surprise here. It doesn't mean that there is no point in running 
> large DBs on consumer hardware or that it is somehow inherently unreliable 
> and you end up with randomly modified data sets.
> 
> 
> 
> >> The approach of fred to not use it is just wrong from a computer-science 
> >> perspective.
> >> 
> >> The fact that it does not perform so well is an implementation issue of 
> >> the 
> >> database, not of the client code which uses the database.
> >
> >No it is not. The load you put on it requires many seeks.
> 
> Are you sure of that?  The primary database will require seeking on pretty 
> much any access due to its nature, alright, but it doesn't need to be 
> accessed all that often actually. How many requests must a busy node handle 
> per second? Maybe 10, if at all and those will be lucky to cause 10 ios. Your 
> standard sata disk can deal with more then that, the standard figures are 
> usually 50 - 100 iops. Write requests to the DBs, caused by insert requests 
> or fetched unknown data, which will require multiple seeks, will drive the 
> load up a bit but if the "write per second to db" figures of my node are any 
> indication then this won't cause severe disk load either, it just doesn't 
> occure often enough.
> 
> 
> This is also my general experience with freenet, the main DBs aren't the 
> issue: Just keep it running on its own and there won't be any issues, even 
> with 500+ GB stores on old 5400 rpm sata disks. Load the node with local 
> requests, be it downloads or something else, and the whole things gets 
> severly disk-limited very very fast. With stuff like WoT, Freetalk or the 
> Spider it is even worse, these can easily get things to unbearable levels in 
> my experience.
> 
> 
> So imho the offender is rather the db4o database which is basically used by 
> clients. The load here is different though, I don't really see how it must 
> strictly be seek-heavy for long durations of time. Splitfile handling for 
> example is more of a bulk data thing. Ideally it should get the data out of 
> the db as needed during a decode, keeping temporary results in memory or in 
> tempfiles if needed and only store the final result back to the db, 
> minimising writes. This is still the equivalent of reading a terribly 
> fragmented file from disk but thats not an unsurmountable task for a standard 
> disk. But considering how io limited splitfile decoding often is on standard 
> disks, and the time it takes, I would really suspect that it loads the DB 
> with uneeded stuff, trying to minimise memory footprint, temp-file usage or 
> something like that. Espescially since just the data retrieval, before the 
> request actually completes, is already often io heavy in my experience.
> 
> The same goes for Freetalk & Co: They can't really cause many transactions 
> per second, since freenet just can't fetch data fast enough to cause enough 
> changes, so either they run really complex queries on really large or complex 
> data sets or the db is inefficiently constructed or inefficiently accessed. 
> No idea but it certainly seems weird.
> 
> 
> I suspect in general that the db4o database is loaded with way too many 
> writes per second which is what will kill DB performance fast. Even your 
> average professionally run DB, which you talk about above, quite often runs 
> on a raid 5 or even raid 6 array out of 10k rpm disks, simply because 
> thats a cost effective way to store bulky stuff, and cost is saved everywhere 
> if at all possible. Those arrays will 
> happily deal with very high read iops compared to your standard consumer disk 
> but won't be happy at all about lots of 
> random writes either, although big battery-backed write-buffers help to some 
> degree. So I am not too sure, that this is just an issue of dealing with 
> consumer hardware or the inherent type of load freenet needs to deal with and 
> not actually just inefficient use of the DB paired with an inefficient dbm 
> for this use-case to begin with.
> 
> 
> 
> Apart from the whole performance thing: 
> 
> I doubt that Fred will get away from rollbacks completly just by never 
> manually triggering one, at least a  node crash and any query-errors should 
> cause an automatic rollback to the state of the last commit. Does fred take 
> this into account, meaning that it only commits when one logical 
> "transaction" is done? If fred just commits every x DB actions or if the 
> trigger is time-based then this could cause logical data corruption in the 
> form of orphaned entries, partially modified ones and so on whenever the node 
> is killed hard in some form or the other, which could then explain some of 
> the db4o corruption going on. This would basically be true transaction abuse.
> 
> 
> Considering using rollbacks generally "good" or "the right way" though isn't 
> really right in my opinion. Avoiding transactions is evil, avoiding rollbacks 
> might be but isn't necessarilly so. Abusing rollbacks to implement standard 
> application logic though (just always start inserting or updating stuff and 
> if you notice half-way through, that you don't actually need to do so, 
> rollback) is certainly evil, too, using up DB ressources and locking tables 
> for no good reason. Best paired with long running transactions involving as 
> many tables as possible to truely lock down the whole DB for anyone else as 
> long as possible.
> 
> 
> 
> >Of course, it might require fewer seeks if it was using a well-designed 
> SQL schema rather than trying to store objects. And I'm only talking 
> about writes here; obviously if everything doesn't fit in memory you 
> need to do seeks on read as well, and they can be very involved given 
> db4o's lack of two-column indexes. However, because of the >constant 
> fsync's, we still needs loads of seeks *even if the whole thing fits in 
> the OS disk cache*!
> 
> 
> I don't know db4o but single column indices shouldn't be a big problem as 
> long as the result-set per index doesn't get too big. Now if you have some 
> kind of boolean column somewhere in a big table then this will basically lead 
> to a partial index-scan which is still doable, if the index is in memory, but 
> will end up horribly if it is not. Has anyone checked how multiple column 
> queries are executed by db4o? Some dbms have quite retarded query optimisers 
> which run into table-scans for less then a suboptimal index situation, so 
> this might be worth a shot ...
> 
> 
> 
> >However, the bottom line is if you have to commit every few seconds, you 
> >have to fsync every few seconds. IMHO avoiding that problem, for 
> instance by turning off fsync and making periodic backups, would 
> dramatically improve performance.
> 
> Fsync is what keeps your data safe it isn't some unnecessary habit of the dbm 
> which it does cause it likes to throw a tantrum or something.  Disabling it 
> and trying to resort to backups is a good classic way to make one of the core 
> parts of any dbm worth its name completly and utterly useless, to end up with 
> messy corruption detection routines (no, the db isn't all fine just cause it 
> still loads) and in the end with completly inconsistent data sets, broken 
> backups and what not, have fun with that.
> 
> 
> 
> >As regards Freenet I am leaning towards a hand-coded on-disk structure. 
> We don't use queries anyway, and mostly we don't need them; most of the 
> data could be handled as a series of flat-files, and it would be far 
> more robust than db4o, and likely faster too.
> 
> Usually one wants to store some kind of object and reference to it via a key. 
> To store it in a flat file one needs to serialise it to a string, or rather 
> to binary data to be exact. Your standard rdbm works with tables which can 
> store any sort of binary data and provide access to it via a key, multiple 
> ones even, and, using indices, do it ressource efficient, too. This must be 
> fate ;)  Seriously, a flat file hand-coded and optimised data storage may 
> outperform a general purpose dbm in some situations, as is the case with all 
> hand-crafted stuff compared to general purpose ones, IF the hand-crafted 
> stuff is done well. But I very much doubt though, that you will ever get 
> close to the reliability your typical dbm provides, surviving software 
> crashes, hardware malfunctions and what not, as long as the circumstances 
> allow it in any way to do so. 
> 
> Now trading reliability for more speed may be a valid thing sometimes, 
> agreed, the same as trading architectural "cleanness" like for example 
> normalisation in for more performance is a valid thing to do sometimes. But 
> while nobody will really care if some key in their nodes store links to the 
> wrong data and most won't even care if the entire store corrupts (as long as 
> it automatically "repairs" by starting from scratch of course), people will 
> scream murder if downloads vanish, uploads corrupt and so on, which is likely 
> to happen with any hand-crafted storage solution, espescially on flakey 
> hardware and in general at first. If a dbm worth its name doesn't manage to 
> hold on to its data, anything hand-crafted won't either. 
> 
> Please don't go down that road, db4o may be slow, but at least it works 
> somewhat reliably, although less so then one might want, but slow storage is 
> still much better then unreliable storage for anything but data one doesn't 
> really care about anyway. Switch to another not object-oriented dbm if thats 
> needed but please don't go back to some hand-crafted stuff which will work in 
> release x, randomly corrupt in release y and fill your disk in release z. If 
> this release cycle also holds true for db4o btw, then thats a pretty horrible 
> result for a dbm.
> 
> 
> In terms of speeds it boils down to this anyway: If the db needs to cause 
> seek-heavy IO to fullfill a query, then any hand-crafted storage will have to 
> do the same, assuming the db and query aren't build inefficiently. If there 
> is much to gain by using a hand-crafted storage, then there should be room 
> for improvement in the use of the db, too. For example by not storing 
> temporary results in the db, like temporary results during splitfile 
> decoding, but just starting the calculation from scratch if the node really 
> crashes.
> 
> 
> 
> Hm, this email got pretty long in the end. Sorry for that - end of rant.


Okay, I'm gonna gather this together into useful pieces. Please feel free to 
have a look at the code!

GOALS AND PERFORMANCE:
Commodity disks are not expected or designed to be reliable if you use them 
heavily all the time. They're the one component in a typical PC that don't obey 
the "if you can't run it at 100% all the time it's broken" rule. :( The 
consequence of this is that constant writes are likely to break a commodity 
hard disk - that is, to significantly reduce its working lifespan.

If Freenet's disk I/O is so heavy that it disrupts the user's ability to use 
everything else on his/her computer, they are likely to uninstall, or at least 
not recommend it to their friends who have fewer spare computers!

THE DATASTORE:
The datastore is fast and reliable. It could be a bit faster. Currently a write 
requires one seek to write the data and another to update the slot filter; next 
includes code to write the slot filter periodically. And grouping writes in RAM 
and periodically writing them would reduce the overall performance impact, 
since they're all close together on the disk and written periodically, which is 
closer to what the disk wants. Note that the datastore does not use a database; 
it's an on-disk hashtable (with salting, using 5 possible slots for each key, 
and so on).
(SaltedHashFreenetStore.java and other classes in src/freenet/store/saltedhash/)

FREETALK AND WEB OF TRUST:
Freetalk and WoT need a real database as a backend. This should probably be 
SQL-based, so that the structures are a bit better designed for storage, and so 
that the engines can be swapped out. And yes, I am quite convinced that it uses 
it inefficiently.

RELIABILITY:
Right now db4o is corrupting itself regularly for a significant minority of 
users, most of them doing uploads. We have had regular self-corruption problems 
with the database we used for the old datastore too, which was Berkeley DB Java 
Edition. In my experience, databases reduce reliability, rather than increasing 
it. However the next step is to upgrade db4o.

CLIENT LAYER AND MEMORY:
Freenet's client layer was originally ported to db4o in an attempt to reduce 
memory usage and allow an unlimited number of downloads regardless of available 
memory, however optimisations since then to reduce disk I/O have used more 
memory. Downloads in particular need no database writes at all except when they 
download a block, FEC decode a segment, start a new layer in the splitfile 
pyramid / unpack a container / follow a redirect, or finally reassemble, 
decompress, filter, copy etc a file. Also, downloads use an in-database 
flatfile table for each segment to store the keys, reducing the number of 
unnecessary object-related seeks. Uploads are much less streamlined, but 
inherently need to do more work, since they obtain a key for each block, and 
don't usually fail to insert a block. FEC decoding involves one seek per block 
for the actual data but also updates lots of data structures, which involve 
more seeks at the db4o level.
(freenet/client/async/ mostly)

CONSISTENCY, BACKUPS AND FSYNC:
Freenet commits periodically and when something important happens rather than 
on every logical transaction. However we do have fsync enabled. If there is a 
crash, db4o rolls back and the database should then be in a consistent, albeit 
outdated, state again. This setup minimises the number of seeks, while avoiding 
data corruption. Downloaded blocks are stored in the persistent-temp.blob file, 
which the database indexes; other temporary files are stored as files. This is 
synchronized with the temporary file handling by simply not freeing any 
temporary files/blocks until after the next commit has completed.
(NodeClientCore, PersistentTempFileBucketFactory etc)

Since data loss inevitably happens due to e.g. hardware problems, we do need 
backups.

However, if we have backups, we can ensure that they are consistent, and we can 
synchronize the tempfile handling similarly to above. The result of which is 
that we effectively use much longer transactions: We can always return to a 
consistent state, and will do so if we have an unclean shutdown, but we can 
turn off fsync in the "working" file. The logical next step is to buffer more 
aggressively and aggregate writes. IMHO this is a viable and useful 
optimisation, particularly for Freetalk/WoT.

FLAT FILE IMPLEMENTATION:
My proposal is that we use a hand-coded flat file table, in a separate file, to 
store the keys for each download, and a few other minor details. For downloads 
this should never be written except for the flags indicating which blocks we 
have, but it should eliminate most of the database I/O and memory usage. For 
uploads we would need to write the block keys etc. This would be *significantly 
MORE reliable* than db4o. We would also need to use a discrete file for the 
downloaded data, instead of our current persistent-temp.blob implementation; 
this would be slightly faster, disk space would be larger, with a higher peak 
on completing a download, but be more predictable, and it would be slightly 
worse if the node is seized. The core classes, dealing with things such as 
redirects, could be handled separately.  Obviously if we keep 
persistent-temp.blob we need a database.

But there is no point starting on this until we have upgraded db4o. I'm not 
sure how much work it would be.

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
Devl@freenetproject.org
https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Disk I/O thread

Reply via email to