aseek-devel  

Re: [aseek-devel] How to index external list of URLs?

Kord Campbell
Wed, 18 Sep 2002 10:50:12 -0700

How fast does it return a search result?  We had managed to index
about a million sites about a year and a half ago, and the search
times were horrible.

Oh, BTW, we do a fair bit of crawling the Internet ourselves. I've
always envisioned that aspseek could have a plugin to take data
from us, but we figured that it couldn't handle the millions of
URLs that we were crawling everyday.

Kord

On Wed, 18 Sep 2002, Yuriy Soroka wrote:

> Yes,
>
> I have indexed 255 179 URLs
> I was indexing  by 20000 - 40000 URLs
>
> var dir size - 1.5 Gb
> I can't say for certain size of mysql database.
>
> Hardware 2 CPU 1.1 GHz each, about 1.5 G of RAM
> OS - FreeBSD 4.5 release p6
>
> no special kernel/mysql tuning was done.
>
>
>
>
> ----- Original Message -----
> From: "Gregory Kozlovsky" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, September 18, 2002 7:05 PM
> Subject: RE: [aseek-devel] How to index external list of URLs?
>
>
> > This is interesting. Can you share with us the size of your database (in
> > docs and in GB),
> > details of your hardware, and tuning of the Linux kernel and the mysql
> > server?
> >
> >      Gregory Kozlovsky
> >
> > -----Original Message-----
> > From: Yuriy Soroka [mailto:[EMAIL PROTECTED]]
> > Sent: Mittwoch, 18. September 2002 02:43
> > To: [EMAIL PROTECTED]
> > Subject: Re: [aseek-devel] How to index external list of URLs?
> >
> >
> > Why don't you just include them to aspseek.conf
> >
> > I indexed 250 000 urls.
> >
> > Include myfile.txt
> >
> >
> > ----- Original Message -----
> > From: "J and T" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Wednesday, September 18, 2002 3:10 AM
> > Subject: [aseek-devel] How to index external list of URLs?
> >
> >
> > > How in the world do you index a list of URLs NOT in the aspseek.conf? I
> > have
> > > tried everything I can think of:
> > >
> > > ./index -i -f myfile.txt
> > > ./index -N 100
> > >
> > > Doesn't work. The myfile.txt lists 5,000 URLs like this:
> > >
> > > Server http://someserver.com/
> > >
> > > But when I run the above (ie, ./index -i -f myfile.txt)
> > >
> > > I get the following error:
> > >
> > > Bad URL: Server http://someserver.com/
> > >
> > > So I removed the "Server " so now it reads:
> > >
> > > http://someserver.com/
> > >
> > > Did the same thing:
> > >
> > > ./index -i -f myfile.txt
> > >
> > > Now it shows them in the database:
> > >
> > > ./index -S
> > >
> > > ASPseek database statistics
> > >
> > >     Status    Expired      Total
> > >    -----------------------------
> > >          0       5000       5000 Not indexed yet
> > >    -----------------------------
> > >      Total       5000       5000
> > >
> > > So now I try to run the indexer:
> > >
> > > ./index -N 100
> > >
> > > And now the indexer gives the same damm error:
> > >
> > > No "Server" command for URL http://www.someserver.com/ - deleted.
> > > ( 0  1  1  0  0  0  0 21) Adding URL: http://www.someserver.com/
> > >
> > > So all it did was delete all these URLs. I have tried every other
> > > combination I can think of after reviewing the ./index -h, but nothing
> > seems
> > > to work. How in the word do you get these indexed using an external
> file?
> > >
> > > Also before when I hard coded all URLs in aspseek.conf there were about
> > 200
> > > URLs which were always shown as "Not Yet Index". How in the heck do you
> > get
> > > them index or delete the damm things?
> > >
> > > It doesn't make sense to have to add thousands of URLs in the
> aspseek.conf
> > > file every time you want to add new URLs to the list. You certainly
> don't
> > > want to set the system to reindex everything specially if you just added
> > > 5,000 URLs the day before. That would use unecessary bandwidth to say
> the
> > > least.
> > >
> > > Anyone have any suggestions?
> > >
> > > end.
> > >
> > > _________________________________________________________________
> > > Chat with friends online, try MSN Messenger: http://messenger.msn.com
> > >
> > >
>

-- 
--------------------------------------------------------------
Kord Campbell                                    Grub.Org Inc.
President                               6051 N. Brookline #118
                                       Oklahoma City, OK 73112
[EMAIL PROTECTED]                            Voice: (405) 843-6336
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------