Hi Donovan,
It sounds to me like the hash should be used as the key. I would
probably use a structure something like this:
class Document(db.Model):
data = db.BlobProperty() # Original 'raw' document data.
hashes = db.StringListProperty(indexed=False) # List of hash
values for this document.
class AddToIndex(db.Model):
"""Set key_name to document_id"""
index = db.StringProperty(required=True) # hash key
class Index(db.Model):
"""Set key_name to your hash value."""
index = db.StringListProperty(required=True)
I am assuming the documents are small enough to fit into a
BlobProperty, if not you can use the blobstore to achieve a similar
result. You also do not tell us how many hashes there will typically
be per document; the hashes per document is important. Are you likely
to get multiple documents at one time that have overlapping hashes,
because the might impact the design too. Also, you don't really tell
us about how you get these documents, or exactly what you will be
doing with them (as in how is the data presented).
So I'll outline a one possible general process -- I'm sure it could be
significantly improved for your use case.
1) A document *somehow* appears for the system to process. Compute
the hash values of the content, and store the doc entity.
# If computing the hashes is a slow process, split it more steps.
hashes = compute_hashes(doc_data)
def txn():
doc_key = str(Document(data=doc_data, hashes=hashes).put())
db.put([AddToIndex(key_name=doc_key, index=hash) for hash in hashes])
db.run_in_transaction(txn)
2) In a processing task, find documents that need added to indexes.
items_to_add = AddToIndex.all().fetch(50)
seen_indexes = set()
for item in items_to_add:
if not item.index in seen_indexes:
insert_index_builder_tasks(item.index)
seen_indexes.add(item.index)
Your index_builder_task should insert named tasks to prevent fork
bombs. So you might use some modulus of the current time in
conjunction with a (memcache based) count for the hash, that way if
you do hit the same hash five minutes apart it can still process but
if the same initial processing task runs three times it won't fork.
3) The index builder only needs the hash to do its thing:
index_hash = self.request.get('index_hash')
work = AddToIndex.all(keys_only=True).filter('index', index_hash).fetch(200)
def txn():
index = Index.get_by_key_name(hash)
index.index.extend([str(doc) for doc in work]) # Add new docs to index
index.index = list(set(index.index)) # Remove any dupes
index.put()
db.run_in_transaction(txn)
db.delete(work)
4) Given a document, finding other documents with similar content is
pretty easy, and only needs one _fetch_!
document = Document.get(document_key)
indexes = Index.get_by_key_name(document.hashes)
Now you can do what ever you need with the list of matching
documents in the index lists!
I implemented something very similar to this process a couple weeks
ago; so far it seems to be working quite well for me. It uses a lot
of tasks, but they are small and very fast.
Robert
On Sat, Dec 18, 2010 at 08:20, Donovan Hide <[email protected]> wrote:
> Hi,
> I have a custom index of a large amount of content that works by creating a
> 32 bit hash for sections of text. Each document id is stored against this
> hash and lookups involve hashing the input and retrieving the matching ids.
> Currently I use node.js to serve the index and hadoop to generate it.
> However this is an expensive operation in terms of processing and requires
> an SSD drive for decent serving performance. The scale of the index is as
> follows:
> Up to 4.5 billions keys
> An average of 8 document ids per key, delta-encoded and then variable
> integer encoded.
> Lookups on average involve retrieving values for 3500 keys
> Having read the datastore docs it seems like this could be a possible
> schema:
> from google.appengine.ext import db
> class Index(db.Model):
> hash=db.IntegerProperty(required=True)
> values=db.BlobProperty(required=True)
> I would be grateful if anyone could give me some advice or tips on how this
> might perform on AppEngine in terms of query performance, cost and
> minimizing metadata/index overhead. It sounds like 4.5 billion*metadata
> storage could be the killer.
> Cheers,
> Donovan
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.