aseek-devel  

Re: [aseek-devel] How to index external list of URLs?

Kir Kolyshkin
Thu, 19 Sep 2002 01:50:08 -0700

 > no special kernel/mysql tuning was done.

Well, you haven't even tried to increase MySQL's key_buffer size
(which is even described in ASPseek's FAQ), but already looking for 
rewriting the code. Seems to be a weird approach to me.

Note that ASPseek does not store everything in SQL DB. Data that are
crucial to search speed is stored in own binary files.

Yuriy Soroka wrote:
> It depends on number of search words in query.
> Normally 2-3 words query is returned within a fractions of second.
> Complicated  query - about 1 second. Maybe little more.
> 
> Anyway i am not satisfied with performance too, and i am interested in
> replacing
> RDBMS with fast native filesystem storage.
> 
> where is the bottleneck? mysql database or indices?
> As for me it seems to be DBMS. Mysql is getting too slow when you have
> couple of millions records in table.
> 
> I was thinking of adding Berkley DB library instead of mysql. For now it is
> just thoughts.
> If anyone can share his experience in this area, please do it.
> I will be glad to hear suggestions from you.
> 
> Yuriy
> 
> ----- Original Message -----
> From: "Kord Campbell" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, September 18, 2002 9:10 PM
> Subject: Re: [aseek-devel] How to index external list of URLs?
> 
> 
> 
>>How fast does it return a search result?  We had managed to index
>>about a million sites about a year and a half ago, and the search
>>times were horrible.
>>
>>Oh, BTW, we do a fair bit of crawling the Internet ourselves. I've
>>always envisioned that aspseek could have a plugin to take data
>>from us, but we figured that it couldn't handle the millions of
>>URLs that we were crawling everyday.
>>
>>Kord
>>
>>On Wed, 18 Sep 2002, Yuriy Soroka wrote:
>>
>>
>>>Yes,
>>>
>>>I have indexed 255 179 URLs
>>>I was indexing  by 20000 - 40000 URLs
>>>
>>>var dir size - 1.5 Gb
>>>I can't say for certain size of mysql database.
>>>
>>>Hardware 2 CPU 1.1 GHz each, about 1.5 G of RAM
>>>OS - FreeBSD 4.5 release p6
>>>
>>>no special kernel/mysql tuning was done.
>>>
>>>
>>>
>>>
>>>----- Original Message -----
>>>From: "Gregory Kozlovsky" <[EMAIL PROTECTED]>
>>>To: <[EMAIL PROTECTED]>
>>>Sent: Wednesday, September 18, 2002 7:05 PM
>>>Subject: RE: [aseek-devel] How to index external list of URLs?
>>>
>>>
>>>
>>>>This is interesting. Can you share with us the size of your database
>>>
> (in
> 
>>>>docs and in GB),
>>>>details of your hardware, and tuning of the Linux kernel and the mysql
>>>>server?
>>>>
>>>>     Gregory Kozlovsky
>>>>
>>>>-----Original Message-----
>>>>From: Yuriy Soroka [mailto:[EMAIL PROTECTED]]
>>>>Sent: Mittwoch, 18. September 2002 02:43
>>>>To: [EMAIL PROTECTED]
>>>>Subject: Re: [aseek-devel] How to index external list of URLs?
>>>>
>>>>
>>>>Why don't you just include them to aspseek.conf
>>>>
>>>>I indexed 250 000 urls.
>>>>
>>>>Include myfile.txt
>>>>
>>>>
>>>>----- Original Message -----
>>>>From: "J and T" <[EMAIL PROTECTED]>
>>>>To: <[EMAIL PROTECTED]>
>>>>Sent: Wednesday, September 18, 2002 3:10 AM
>>>>Subject: [aseek-devel] How to index external list of URLs?
>>>>
>>>>
>>>>
>>>>>How in the world do you index a list of URLs NOT in the
>>>>
> aspseek.conf? I
> 
>>>>have
>>>>
>>>>>tried everything I can think of:
>>>>>
>>>>>./index -i -f myfile.txt
>>>>>./index -N 100
>>>>>
>>>>>Doesn't work. The myfile.txt lists 5,000 URLs like this:
>>>>>
>>>>>Server http://someserver.com/
>>>>>
>>>>>But when I run the above (ie, ./index -i -f myfile.txt)
>>>>>
>>>>>I get the following error:
>>>>>
>>>>>Bad URL: Server http://someserver.com/
>>>>>
>>>>>So I removed the "Server " so now it reads:
>>>>>
>>>>>http://someserver.com/
>>>>>
>>>>>Did the same thing:
>>>>>
>>>>>./index -i -f myfile.txt
>>>>>
>>>>>Now it shows them in the database:
>>>>>
>>>>>./index -S
>>>>>
>>>>>ASPseek database statistics
>>>>>
>>>>>    Status    Expired      Total
>>>>>   -----------------------------
>>>>>         0       5000       5000 Not indexed yet
>>>>>   -----------------------------
>>>>>     Total       5000       5000
>>>>>
>>>>>So now I try to run the indexer:
>>>>>
>>>>>./index -N 100
>>>>>
>>>>>And now the indexer gives the same damm error:
>>>>>
>>>>>No "Server" command for URL http://www.someserver.com/ - deleted.
>>>>>( 0  1  1  0  0  0  0 21) Adding URL: http://www.someserver.com/
>>>>>
>>>>>So all it did was delete all these URLs. I have tried every other
>>>>>combination I can think of after reviewing the ./index -h, but
>>>>
> nothing
> 
>>>>seems
>>>>
>>>>>to work. How in the word do you get these indexed using an external
>>>>
>>>file?
>>>
>>>>>Also before when I hard coded all URLs in aspseek.conf there were
>>>>
> about
> 
>>>>200
>>>>
>>>>>URLs which were always shown as "Not Yet Index". How in the heck do
>>>>
> you
> 
>>>>get
>>>>
>>>>>them index or delete the damm things?
>>>>>
>>>>>It doesn't make sense to have to add thousands of URLs in the
>>>>
>>>aspseek.conf
>>>
>>>>>file every time you want to add new URLs to the list. You certainly
>>>>
>>>don't
>>>
>>>>>want to set the system to reindex everything specially if you just
>>>>
> added
> 
>>>>>5,000 URLs the day before. That would use unecessary bandwidth to
>>>>
> say
> 
>>>the
>>>
>>>>>least.
>>>>>
>>>>>Anyone have any suggestions?
>>>>>
>>>>>end.
>>>>>
>>>>>_________________________________________________________________
>>>>>Chat with friends online, try MSN Messenger:
>>>>
> http://messenger.msn.com
> 
>>>>>
>>--
>>--------------------------------------------------------------
>>Kord Campbell                                    Grub.Org Inc.
>>President                               6051 N. Brookline #118
>>                                       Oklahoma City, OK 73112
>>[EMAIL PROTECTED]                            Voice: (405) 843-6336
>>http://www.grub.org                        Fax: (405) 848-5477
>>--------------------------------------------------------------
>>
>>
> 
> 
> 
> 


-- 
-- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
    Guinness a Day Keeps a Doctor Away (people's wisdom)