Thanks for the tips, things seem happier now. Yeah, the size of each document (number of tokens) is actually quite small in my case - I think this is just case of me messing up the flush/optimize/close tactics.
On 10/10/06, peter <[EMAIL PROTECTED]> wrote: > We've had somewhat of a similar situation ourselves, where we are indexing > about a million records to an index, and each record can be somewhat large. > > Now..what happened on our side was that the index files (very similar in > structure to what you have below) came up to a 2 gig limit and stopped > there..and the indexer started crashing each time it hit this limit. > > On your side, I don't see your index file sizes really that large. I think > the compiling with large file support only really kicks in when you hit this > 2 gig size limit. > > Couple of thoughts that might help: > 1. On our side, to keep size down, I would optimize the index at every > 100,000 documents. The optimize call also flushes the index. > > 2. Make sure you close the index once you index your data. Small > thing..but just making sure. > > 3. With the index being this large, we actually have two copies, one for > searching against an already optimized index, and the other copy doing the > indexing. This way, no items are being searched on while the indexing is > taking place. > > 4. One neat thing that I learned with indexing large items, was that I > don't have to actually store everything. I can have a field set to > tokenize, but not store, so that it can be searched..but I don't need it to > be displayed in the search results per say..I don't actually store it, so I > was able to keep my index size down. > > > > > From: "Ben Lee" <[EMAIL PROTECTED]> > > Reply-To: [email protected] > > Date: Tue, 10 Oct 2006 18:35:35 -0700 > > To: [email protected] > > Subject: [Ferret-talk] Indexing problem 10.9/10.10 > > > > Sorry if this is a repost- I wasn't sure if the www.ruby-forum.com > > list works for postings. > > I've been having trouble with indexing a large amount of documents(2.4M). > > > > > > Essentially, I have one process that is following the tutorial > > dumping documents to an index stored on the file system. If I open the > > index with another process, and run the size() method it is stuck at > > a number of documents much smaller than the number I've added to the index. > > > > Eg. 290k -- when the indexer process has already gone through 1 M. > > > > Additionally, if I search, I don't get results past an > > even smaller number of docs (22k) . I've tried the two latest ferret > > releases. > > > > > > Does this listing of the index directory look right? > > > > -rw------- 1 blee blee 3.8M Oct 10 17:06 _v.fdt > > -rw------- 1 blee blee 51K Oct 10 17:06 _v.fdx > > -rw------- 1 blee blee 12M Oct 10 16:49 _u.cfs > > -rw------- 1 blee blee 97 Oct 10 16:49 fields > > > > -rw------- 1 blee blee 78 Oct 10 16:49 segments > > -rw------- 1 blee blee 11M Oct 10 16:23 _t.cfs > > -rw------- 1 blee blee 11M Oct 10 15:56 _s.cfs > > -rw------- 1 blee blee 15M Oct 10 15:11 _r.cfs > > -rw------- 1 blee blee 13M Oct 10 14:48 _q.cfs > > > > -rw------- 1 blee blee 14M Oct 10 14:37 _p.cfs > > -rw------- 1 blee blee 13M Oct 10 14:28 _o.cfs > > -rw------- 1 blee blee 12M Oct 10 14:19 _n.cfs > > -rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs > > -rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs > > > > -rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs > > -rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck > > > > Thanks, > > Ben > > _______________________________________________ > > Ferret-talk mailing list > > [email protected] > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

