I do realize its kind of a weird use case, but it is legitimate. I have a 
collection of documents that I need to index, and I want to perform entity 
extraction on them and give the extracted entities special treatment in my 
full-text index. Because entity extraction costs money, and each document will 
end up being indexed multiple times, I want to cache them in Cassandra. The 
document text is the obvious key to retrieve entities from the cache. If I use 
the document ID, then I have to track timestamps. I know that sounds like a 
simple workaround, but I’m presenting a much-simplified view of my actual data 
model.

The reason for needing the text in the table, and not just a digest, is that 
sometimes entity extraction has to be deferred due to license limitations. In 
those cases, the entity extraction occurs on a background process, and the 
entities will be included in the index the next time the document is indexed.

I will use a digest as the key. I suspected that would be the answer, but its 
good to get confirmation.

Robert

On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote:

> Hi Robert,
> 
> why do you need the actual text as a key? I sounds a bit unatural at least 
> for me. Keep in mind that you cannot do "like" queries on keys in cassandra. 
> For performance and keeping things more readable I would prefer hashing your 
> text and use the hash as key.
> 
> You should also take into account to store the keys (hashes) in a seperate 
> table per day / hour or something like that, so you can quickly get all keys 
> for a time range. A query without the partition key may be very slow.
> 
> Jan
> 
> Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> I have a need to be able to use the text of a document as the primary key in 
>> a table. These texts are usually less than 1K, but can sometimes be 10’s of 
>> K’s in size. Would it be better to use a digest of the text as the key? I 
>> have a background process that will occasionally need to do a full table 
>> scan and retrieve all of the texts, so using the digest doesn’t eliminate 
>> the need to store the text. Anyway, is it better to keep primary keys small, 
>> or is C* okay with large primary keys?
>> 
>> Robert
>> 
> 

Reply via email to