- - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: Torvas, A Subject: Advised Settings for 26 Million+ Documents
Hi: I am trying to index info about a city ( about 120 M documents by first estimate) using dpsearch and stumbled upon this article. Do you have a running site with the 26M documents ? Can you tell me what is the average speed for extracting results for a keyword set that gives about a million results. Regards, A Torvas > At 02:34:28 07/02/07, Jon wrote: >I have a system setup that has been able to index the following, but has now >hit a wall: > > Database statistics > >Status Expired Total > ----------------------------- > 0 1480579 1480579 Not indexed yet > 152 3 3 Unknown status > 200 948284 1276984 OK > 206 172 185 Partial OK > 301 1447 1454 Moved Permanently > 302 210421 701600 Moved Temporarily > 304 0 1226 Not Modified > 400 1 1 Bad Request > 403 19987 20703 Forbidden > 404 0 16839 Not found > 415 17010 19633 Unsupported Media Type > 500 1 1 Internal Server Error > 503 3 3 Service Unavailable > 504 212 212 Gateway Timeout > 2200 243564 273431 Clones, OK > 2206 871 938 Clones, Patial OK > ----------------------------- > Total 2922555 3793792 > >The system starts indexing (gets about 100 documents indexed) and then cached >sits at 100% CPU doing nothing. lsof shows that is has a data file open but is >doing little, rather, nothing with the file and gets stuck on one file. > >Cpu(s): 29.3% us, 70.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si >cached 27965 cache 11uW REG 253,0 1232336 24086460 >/usr/local/dpsearchPG/var/url/info0009.i >cached 27965 cache 12u REG 253,0 18097180 24086374 >/usr/local/dpsearchPG/var/url/info0009.s > >These files are not exactly very large: > >1.2M /usr/local/dpsearchPG/var/url/info0009.i >18M /usr/local/dpsearchPG/var/url/info0009.s > >I've let cache run like this for days. Still, nothing. > >I am using cached mode with searchd and searchcgi. My cached has the following >settings: > >WrdFiles 256 >CacheLogWords 1024 >CacheLogDels 1024 >URLDataFiles 256 >OptimizeAtUpdate no > >I wonder what these *should* be to support 26 million documents. What should >my strategy be to index this many documents? What switches should be used? I'd >like to just be able to run indexer all the time on multiple nodes and just >keep updating the data searchd uses in stages. I am running cached in log only >mode and at times will issue a write to the datafiles. - - - - - - - - - - - - - - - - - - - - - - - - - - - - Read the full topic here: http://www.dataparksearch.org/cgi-bin/simpleforum.cgi?fid=02&topic_id=1170804868&page=1
