Re: Large primary keys

Robert Wille Thu, 14 Apr 2016 07:52:00 -0700

That would be a nice solution, but 3.4 is way too bleeding edge. I’ll just go 
with the digest for now. Thanks for pointing it out. I’ll have to consider a 
migration in the future when production is on 3.x.

On Apr 11, 2016, at 10:19 PM, Jack Krupansky 
<jack.krupan...@gmail.com<mailto:jack.krupan...@gmail.com>> wrote:

Check out the text indexing feature of the new SASI feature in Cassandra 3.4. 
You could write a custom tokenizer to extract entities and then be able to 
query for documents that contain those entities.

That said, using a SHA digest key for the primary key has merit for direct 
access to the document given the document text.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 7:12 PM, James Carman 
<ja...@carmanconsulting.com<mailto:ja...@carmanconsulting.com>> wrote:
S3 maybe?

On Mon, Apr 11, 2016 at 7:05 PM Robert Wille 
<rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote:
I do realize its kind of a weird use case, but it is legitimate. I have a 
collection of documents that I need to index, and I want to perform entity 
extraction on them and give the extracted entities special treatment in my 
full-text index. Because entity extraction costs money, and each document will 
end up being indexed multiple times, I want to cache them in Cassandra. The 
document text is the obvious key to retrieve entities from the cache. If I use 
the document ID, then I have to track timestamps. I know that sounds like a 
simple workaround, but I’m presenting a much-simplified view of my actual data 
model.

The reason for needing the text in the table, and not just a digest, is that 
sometimes entity extraction has to be deferred due to license limitations. In 
those cases, the entity extraction occurs on a background process, and the 
entities will be included in the index the next time the document is indexed.

I will use a digest as the key. I suspected that would be the answer, but its 
good to get confirmation.

Robert

On Apr 11, 2016, at 4:36 PM, Jan Kesten 
<j.kes...@enercast.de<mailto:j.kes...@enercast.de>> wrote:

> Hi Robert,
>
> why do you need the actual text as a key? I sounds a bit unatural at least 
> for me. Keep in mind that you cannot do "like" queries on keys in cassandra. 
> For performance and keeping things more readable I would prefer hashing your 
> text and use the hash as key.
>
> You should also take into account to store the keys (hashes) in a seperate 
> table per day / hour or something like that, so you can quickly get all keys 
> for a time range. A query without the partition key may be very slow.
>
> Jan
>
> Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> I have a need to be able to use the text of a document as the primary key in 
>> a table. These texts are usually less than 1K, but can sometimes be 10’s of 
>> K’s in size. Would it be better to use a digest of the text as the key? I 
>> have a background process that will occasionally need to do a full table 
>> scan and retrieve all of the texts, so using the digest doesn’t eliminate 
>> the need to store the text. Anyway, is it better to keep primary keys small, 
>> or is C* okay with large primary keys?
>>
>> Robert
>>
>

Re: Large primary keys

Reply via email to