That would be a nice solution, but 3.4 is way too bleeding edge. I’ll just go with the digest for now. Thanks for pointing it out. I’ll have to consider a migration in the future when production is on 3.x.
On Apr 11, 2016, at 10:19 PM, Jack Krupansky <jack.krupan...@gmail.com<mailto:jack.krupan...@gmail.com>> wrote: Check out the text indexing feature of the new SASI feature in Cassandra 3.4. You could write a custom tokenizer to extract entities and then be able to query for documents that contain those entities. That said, using a SHA digest key for the primary key has merit for direct access to the document given the document text. -- Jack Krupansky On Mon, Apr 11, 2016 at 7:12 PM, James Carman <ja...@carmanconsulting.com<mailto:ja...@carmanconsulting.com>> wrote: S3 maybe? On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote: I do realize its kind of a weird use case, but it is legitimate. I have a collection of documents that I need to index, and I want to perform entity extraction on them and give the extracted entities special treatment in my full-text index. Because entity extraction costs money, and each document will end up being indexed multiple times, I want to cache them in Cassandra. The document text is the obvious key to retrieve entities from the cache. If I use the document ID, then I have to track timestamps. I know that sounds like a simple workaround, but I’m presenting a much-simplified view of my actual data model. The reason for needing the text in the table, and not just a digest, is that sometimes entity extraction has to be deferred due to license limitations. In those cases, the entity extraction occurs on a background process, and the entities will be included in the index the next time the document is indexed. I will use a digest as the key. I suspected that would be the answer, but its good to get confirmation. Robert On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de<mailto:j.kes...@enercast.de>> wrote: > Hi Robert, > > why do you need the actual text as a key? I sounds a bit unatural at least > for me. Keep in mind that you cannot do "like" queries on keys in cassandra. > For performance and keeping things more readable I would prefer hashing your > text and use the hash as key. > > You should also take into account to store the keys (hashes) in a seperate > table per day / hour or something like that, so you can quickly get all keys > for a time range. A query without the partition key may be very slow. > > Jan > > Am 11.04.2016 um 23:43 schrieb Robert Wille: >> I have a need to be able to use the text of a document as the primary key in >> a table. These texts are usually less than 1K, but can sometimes be 10’s of >> K’s in size. Would it be better to use a digest of the text as the key? I >> have a background process that will occasionally need to do a full table >> scan and retrieve all of the texts, so using the digest doesn’t eliminate >> the need to store the text. Anyway, is it better to keep primary keys small, >> or is C* okay with large primary keys? >> >> Robert >> >