Got it. I'm working on making term vectors optional and just store frequency in this case. Just FYI.
On Sat, May 8, 2010 at 1:17 AM, Tobias Jungen <tobias.jun...@gmail.com>wrote: > Without going into too much depth: Our retrieval model is a bit more > structured than standard lucene retrieval, and I'm trying to leverage that > structure. Some of the terms we're going to retrieve against have high > occurrence, and because of that I'm worried about getting killed by > processing large term vectors. Instead I'm trying to index on term > relationships, if that makes sense. > > > On Sat, May 8, 2010 at 12:09 AM, Jake Luciani <jak...@gmail.com> wrote: > >> Any reason why you aren't using Lucandra directly? >> >> >> On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jun...@gmail.com>wrote: >> >>> Greetings, >>> >>> Started getting my feet wet with Cassandra in earnest this week. I'm >>> building a custom inverted index of sorts on top of Cassandra, in part >>> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded >>> nearly a million documents over a 3-node cluster, and initial query tests >>> look promising. >>> >>> The problem is that our target use case has hundreds of millions of >>> documents (each document is very small however). Loading time will be an >>> important factor. I've investigated using the BinaryMemtable interface (as >>> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype >>> up that successfully inserts data using BMT, but there is a problem. >>> >>> If I perform multiple writes for the same row key & column family, the >>> row ends up containing only one of the writes. I'm guessing this is because >>> with BMT I need to group all writes for a given row key & column family into >>> one operation, rather than doing it incrementally as is possible with the >>> thrift interface. Hadoop obviously is the solution for doing such a >>> grouping. Unfortunately, we can't perform such a process over our entire >>> dataset, we will need to do it in increments. >>> >>> So my question is: If I properly flush every node after performing a >>> larger bulk insert, can Cassandra merge multiple writes on a single row & >>> column family when using the BMT interface? Or is using BMT only feasible >>> for loading data on rows that don't exist yet? >>> >>> Thanks in advance, >>> Toby Jungen >>> >>> >>> >>> >> >