http://www.ioremap.net/node/197

Elliptics network: approaching the milestone

Tagged:  

What did happen during the last 4 days? I did not blog for a while slacking and thinking about the next step I want to make with the elliptics network.
Let's look at the things in order of appearence.

First, I implemented set of testing tools for the performance and integrity testing. And while they are rather dumb, it is possible to check store/load sequence of the files of different sizes and manually check that they are the same (by looking at their md5 checksums). Thinking about set of fully automatic tests.

Second, I implemented configurable server backends to store data. Backend, used in the previous version, which stores transactions as files, was moved into example server. I have plans to add BerkleyDB one and SMTP-store/IMAP-load backends.

During the testing I found that file-based IO storage backend, when operated on top of XFS partition, has rather small performance with syncs turned on (sync of the file after object was written). Turning off syncs resulted in 10 times higher numbers - with 10kb writes it jumbed from 300 KB/s upto 2-3 MB/s. XFS is known to be slow in the workload where huge amount of rather small files is created in the directory (I use 256 directories indexed by the first byte of the transaction ID) though, for example /dev/shm used as a backend easily allows to suck in the whole 1GigE bandwidth (more than 110 MB/s of the pure data not counting headers).

And still I expected better results. Playing with the system I found (actually proved again) a simple way to livelock two threads which send a messages in a ping-pong matter on two different machines. Depending on the socket buffer size, message length and their amount threads may block quite easily.

After some work I efectively ended up with the solution where dedicated thread reads the whole transaction from the network and queues it into the per-node list, which is processed by the configurable number of worker threads. Since receiving thread never sends any data back to the remote nodes, described livelock is not possible.

This technique noticebly bumped the performance, but it also introduced some bugs I work on now.
Another issue is realted to the modulo operation over the ring of the IDs - right now we have problems fetching IDs during the joining handshake which are less than zero for example (which in the ring becomes less than 2^128-1).

In a meantime I also implemented server-side transformation functions, which maybe used to force multiple data copies on behalf of the server. Server could also return its transformation functions to the client, so that it could use them to fetch data from different nodes either in parallel or as a failover case when some nodes are not accessible. Only the first part of this idea (having server-side transformation functions) is implemented, they are not used by the server right now.
It could be also not a bad idea to have an object deletion command.

Those are the only tasks I have in mind for the next release.

I also wrote a simple distributing 'expect' script, but cluster, I had access to, moved away and will be accessible again only in about a week, so no fancy numbers with hundred of nodes in the cloud for now, will play this game later.

Those are the news, expect more soon!


Reply via email to