aseek-devel  

Re: [aseek-devel] How to index external list of URLs?

Kir Kolyshkin
Thu, 19 Sep 2002 01:46:49 -0700

ASPseek theoretical limit is about 50 million pages (URLs).
How many URLs you have?

Kord Campbell wrote:
> How fast does it return a search result?  We had managed to index
> about a million sites about a year and a half ago, and the search
> times were horrible.
> 
> Oh, BTW, we do a fair bit of crawling the Internet ourselves. I've
> always envisioned that aspseek could have a plugin to take data
> from us, but we figured that it couldn't handle the millions of
> URLs that we were crawling everyday.
> 
> Kord
> 
> On Wed, 18 Sep 2002, Yuriy Soroka wrote:
> 
> 
>>Yes,
>>
>>I have indexed 255 179 URLs
>>I was indexing  by 20000 - 40000 URLs
>>
>>var dir size - 1.5 Gb
>>I can't say for certain size of mysql database.
>>
>>Hardware 2 CPU 1.1 GHz each, about 1.5 G of RAM
>>OS - FreeBSD 4.5 release p6
>>
>>no special kernel/mysql tuning was done.
>>
>>
>>
>>
>>----- Original Message -----
>>From: "Gregory Kozlovsky" <[EMAIL PROTECTED]>
>>To: <[EMAIL PROTECTED]>
>>Sent: Wednesday, September 18, 2002 7:05 PM
>>Subject: RE: [aseek-devel] How to index external list of URLs?
>>
>>
>>
>>>This is interesting. Can you share with us the size of your database (in
>>>docs and in GB),
>>>details of your hardware, and tuning of the Linux kernel and the mysql
>>>server?
>>>
>>>     Gregory Kozlovsky
>>>
>>>-----Original Message-----
>>>From: Yuriy Soroka [mailto:[EMAIL PROTECTED]]
>>>Sent: Mittwoch, 18. September 2002 02:43
>>>To: [EMAIL PROTECTED]
>>>Subject: Re: [aseek-devel] How to index external list of URLs?
>>>
>>>
>>>Why don't you just include them to aspseek.conf
>>>
>>>I indexed 250 000 urls.
>>>
>>>Include myfile.txt
>>>
>>>
>>>----- Original Message -----
>>>From: "J and T" <[EMAIL PROTECTED]>
>>>To: <[EMAIL PROTECTED]>
>>>Sent: Wednesday, September 18, 2002 3:10 AM
>>>Subject: [aseek-devel] How to index external list of URLs?
>>>
>>>
>>>
>>>>How in the world do you index a list of URLs NOT in the aspseek.conf? I
>>>
>>>have
>>>
>>>>tried everything I can think of:
>>>>
>>>>./index -i -f myfile.txt
>>>>./index -N 100
>>>>
>>>>Doesn't work. The myfile.txt lists 5,000 URLs like this:
>>>>
>>>>Server http://someserver.com/
>>>>
>>>>But when I run the above (ie, ./index -i -f myfile.txt)
>>>>
>>>>I get the following error:
>>>>
>>>>Bad URL: Server http://someserver.com/
>>>>
>>>>So I removed the "Server " so now it reads:
>>>>
>>>>http://someserver.com/
>>>>
>>>>Did the same thing:
>>>>
>>>>./index -i -f myfile.txt
>>>>
>>>>Now it shows them in the database:
>>>>
>>>>./index -S
>>>>
>>>>ASPseek database statistics
>>>>
>>>>    Status    Expired      Total
>>>>   -----------------------------
>>>>         0       5000       5000 Not indexed yet
>>>>   -----------------------------
>>>>     Total       5000       5000
>>>>
>>>>So now I try to run the indexer:
>>>>
>>>>./index -N 100
>>>>
>>>>And now the indexer gives the same damm error:
>>>>
>>>>No "Server" command for URL http://www.someserver.com/ - deleted.
>>>>( 0  1  1  0  0  0  0 21) Adding URL: http://www.someserver.com/
>>>>
>>>>So all it did was delete all these URLs. I have tried every other
>>>>combination I can think of after reviewing the ./index -h, but nothing
>>>
>>>seems
>>>
>>>>to work. How in the word do you get these indexed using an external
>>>
>>file?
>>
>>>>Also before when I hard coded all URLs in aspseek.conf there were about
>>>
>>>200
>>>
>>>>URLs which were always shown as "Not Yet Index". How in the heck do you
>>>
>>>get
>>>
>>>>them index or delete the damm things?
>>>>
>>>>It doesn't make sense to have to add thousands of URLs in the
>>>
>>aspseek.conf
>>
>>>>file every time you want to add new URLs to the list. You certainly
>>>
>>don't
>>
>>>>want to set the system to reindex everything specially if you just added
>>>>5,000 URLs the day before. That would use unecessary bandwidth to say
>>>
>>the
>>
>>>>least.
>>>>
>>>>Anyone have any suggestions?
>>>>
>>>>end.
>>>>
>>>>_________________________________________________________________
>>>>Chat with friends online, try MSN Messenger: http://messenger.msn.com
>>>>
>>>>
>>>
> 


-- 
-- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
    Guinness a Day Keeps a Doctor Away (people's wisdom)