I already have a working de-duplicating ROBUST asset service using hashing. It was not at all hard to do (largely because SRAS had already demonstrated how it could be done) so complexity on this end is not an issue.

I have read many articles on filesystem vs blob storage. There are pros and cons either way. From what I've read, the performance difference is actually quite small.

As this service is for light to medium use, in my opinion the simplicity of managing just a database wins out over the advantages of a filesystem approach. Anybody wanting filesystem storage right now can use SRAS [1], which is a third party project developed externally from opensim-core that provides this and other extra features or you can roll your own, which would not be difficult for anybody moderately competent in PHP to do.

If you're running a large grid this is always going to entail extra work and co-ordination of components, just like running a large website.

[1] https://github.com/coyled/sras

On 09/03/12 04:06, Wade Schuette wrote:
Justin,

I have to respectfully agree with Cory.

Wouldn't something like the following address your valid concerns about 
complexity and reducing total load as well as
perceived system response time to both filing and retrieving assets?

First, if you use event-driven processes, there's no reason to rescan the 
entire database, and by separating the
processes into distinct streams, they are decoupled which is actually a good 
thing and simplifies both sides. There's no
reason I can see they need to be coupled, and separating them allows them to be 
optimized and tested separately, which
is a good thing.

In fact, the entire deduplication process could run overnight at a low-load 
time, which is even better, or have multiple
"worker" processes assisgned to it, if it's taking too long. Seems very 
flexible.

I'm assuming that a hash-code isn't unique, but just specifies the bucket into 
which this item can be categorized.

When a new asset arrives, if the hash-code already exists, put the unique-ID in 
a pipe and finish filing it and move on.
If the hash-code doesn't already exist, just file it and move on.

At the other end of the pipe, this wakes up a process that can, as time allows, 
check in the background to see if not
only the hash-code is the same, but the entire item is the same, and if so, 
change the handle to point to the existing
copy. ( For all I know, this can be done in one step if CRC codes are 
sufficiently unique, but computing such a code is
cpu intensive unless you can do it in hardware.)

Of course, now the question arises of what happens when the original person 
DELETES the shared item. If you have solid
database integrity, you only need to know how many pointers to it exist, and if someone 
deletes "their copy", you
decrease the count by one, and when the count gets to one, the next delete can 
actually delete the entry.



Wade




On 3/8/12 7:41 PM, Justin Clark-Casey wrote:
On 08/03/12 22:00, Rory Slegtenhorst wrote:
@Justin
Can't we do the data de-duplication on a database level? Eg find the duplicates 
and just get rid of them on a regular
interval (cron)?

This would be enormously intricate. Not only would you have to keep rescanning 
the entire asset db but it adds another
moving part to an already complex system.


_______________________________________________
Opensim-dev mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/opensim-dev



--
Justin Clark-Casey (justincc)
http://justincc.org/blog
http://twitter.com/justincc
_______________________________________________
Opensim-dev mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/opensim-dev

Reply via email to