[DataparkSearch Forum] Advised Settings for 26 Million+ Documents

DataparkSearchForum Tue, 17 Apr 2007 21:01:13 -0700

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Name: Torvas, A
Subject: Advised Settings for 26 Million+ Documents


Hi:
I am trying to index info about a city ( about 120 M documents by first 
estimate) using dpsearch and stumbled upon this article. Do you have a running 
site with the 26M documents ? Can you tell me what is the average speed for 
extracting results for a keyword set that gives about a million results.
Regards,
A Torvas

> At 02:34:28  07/02/07, Jon wrote:
>I have a system setup that has been able to index the following, but has now 
>hit a wall:
>
>         Database statistics
>
>Status    Expired      Total
>   -----------------------------
>     0    1480579    1480579 Not indexed yet
>   152          3          3 Unknown status
>   200     948284    1276984 OK
>   206        172        185 Partial OK
>   301       1447       1454 Moved Permanently
>   302     210421     701600 Moved Temporarily
>   304          0       1226 Not Modified
>   400          1          1 Bad Request
>   403      19987      20703 Forbidden
>   404          0      16839 Not found
>   415      17010      19633 Unsupported Media Type
>   500          1          1 Internal Server Error
>   503          3          3 Service Unavailable
>   504        212        212 Gateway Timeout
>  2200     243564     273431 Clones, OK
>  2206        871        938 Clones, Patial OK
>   -----------------------------
> Total    2922555    3793792
>
>The system starts indexing (gets about 100 documents indexed) and then cached 
>sits at 100% CPU doing nothing. lsof shows that is has a data file open but is 
>doing little, rather, nothing with the file and gets stuck on one file.
>
>Cpu(s): 29.3% us, 70.7% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
>cached  27965 cache   11uW  REG      253,0  1232336 24086460 
>/usr/local/dpsearchPG/var/url/info0009.i
>cached  27965 cache   12u   REG      253,0 18097180 24086374 
>/usr/local/dpsearchPG/var/url/info0009.s
>
>These files are not exactly very large:
>
>1.2M    /usr/local/dpsearchPG/var/url/info0009.i
>18M     /usr/local/dpsearchPG/var/url/info0009.s
>
>I've let cache run like this for days. Still, nothing.
>
>I am using cached mode with searchd and searchcgi. My cached has the following 
>settings:
>
>WrdFiles 256
>CacheLogWords 1024
>CacheLogDels 1024
>URLDataFiles 256
>OptimizeAtUpdate no
>
>I wonder what these *should* be to support 26 million documents. What should 
>my strategy be to index this many documents? What switches should be used? I'd 
>like to just be able to run indexer all the time on multiple nodes and just 
>keep updating the data searchd uses in stages. I am running cached in log only 
>mode and at times will issue a write to the datafiles.



- - - - - - - - - - - - - - - - - - - - - - - - - - - -

Read the full topic here:
http://www.dataparksearch.org/cgi-bin/simpleforum.cgi?fid=02&topic_id=1170804868&page=1

[DataparkSearch Forum] Advised Settings for 26 Million+ Documents

Reply via email to