On 01/24/2011 03:19 AM, Ben Chobot wrote:
On Jan 23, 2011, at 3:29 AM, Attila Nagy wrote:

Hello,

I'm looking for a database backend for a dictionary coder project. It would 
have three major tasks:
- take a text corpus, get their words and substitute each word by a 64 bit 
integer (the word:integer is always constant) and store the result (encoding)
- take the previous result and substitute the integers with words (decoding)
- the words should be reference counted, so if a word can be no longer found in 
any of the encoded messages, delete it (and optionally free it's integer ID, 
but 64 bit is believed to be enough for a long time, although having smaller 
IDs result smaller encoded files). This could be achieved by informing the 
database of the words of a deleted message, so it could decrement those 
refcounts and delete the records if needed.
Just curious, is this your end goal, or would some existing text search engine 
(tsearch2, lucerne, etc) fit the bill?
Yes, this isn't a text search project. Maybe data deduplication is closer. Of course if a text search stuff fitst this, it's welcome. :)

I can easily do this with any RDBMS, with a table of three columns: auto 
incremented ID, word and refcount, with a unique index on word.
The challenge could be:
- that it should scale to several TBs of size and several (hundred) billion of 
records. One scenario would be to store about 40 TBs of words and the average 
word length would be about 50-60 bytes (that's about 800*10^9 records). It 
should work well both for inserting and searching (encoding and decoding) words.
Yes, Postgres can do this, but you're probably going to want to look into 
partitioning your tables. For instance, instead of a single word table, you can 
make word tables that store 100M words each - or if  100M ids per table doesn't 
sound right, pick some other number. You can still access the table as if it's 
a single table, and under the covers PG will use constraint exclusion to know 
which table partition to use.
The ID space should be continuous and unique across the tables, so a traditional table partitioning across a key should work, but partitioning as in sharding seems to be harder...
- I need atomicity and durability, but having these on a word (record) level 
takes too much IOPS and have no use, so it would be good to have an interface 
for inserting about 1000-500000 words in one call, assign a unique ID to each 
unique words and store them (if the word has had already an ID, increment its 
refcount) and give back the IDs for each words. This transaction could be 
committed as one, so the transactions could be big, sparing IOPS.
I must be something something because it's unclear why standard transactions 
don't give you this? Except rolling back IDs in the case of an abort, of 
course. Why do you need that?
Think of two documents, each of them arrive in the same time to the encoder. If the two documents both have the word "boo", it must be given the same ID. Giving unique constraint error for some rows of a batch insert severely hits performance.

- I need concurrency, so when the above happens from two sources at the same 
time, the same word in the two transactions must get the same ID
You can't have this and the previous constraint, at least not with PG. I don't 
know of any system that will give you durable words but still detect when the 
same word is being inserted in parallel transactions, AND save you the iops of 
word-by-word writes.

Unless, of course, you're ok with one transaction failing due to trying to add 
a word that wasn't there at the start? If you did something with a unique 
constraint on the word, you might be able to achieve this.
I agree, this seems to be hard in a concurrent manner in a database (maybe with a clever caching stored procedure)..

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to