Hello, Andrej
Below are my answers
----- Original Message -----
From: "Andrej Filipcic" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, February 09, 2001 3:34 AM
Subject: [aseek-users] Few questions about aspseek
>
> Hello,
>
> First af all, thanks for the great engine. I have been using udmsearch for
> indexing about 2M sites in Slovenia (.si) and aspseek performs much better
> at this scale.
Thanks.
However, I have some questions about it.
>
> 1) Handling deleted documents: Are those never removed from database?
There is
> also a problem with robots.txt with deleted=1 flag. It seems that indexer
> does not check this file any more, unless I put deleted=0 by hand in db.
> This is realy a problem if some site admin puts robots.txt after some huge
> access to his site...
First of all, document is marked deleted only if DeleteBad parameter for
server is on in aspseek.conf.
Document is not deleted at a time of indexing becuse it's ID will exist
in reverse index until updaing of delta files.
Documents marked deleted are not really deleted from DB, but particular
document can be restored (to deleted=0) in future if it will appear again.
As for problems with robots.txt, we will fix it in the next release. Now
you can try to use "DeleteBad off" in aspseek.conf.
>
> 2) index -D runs quite slow. I have 1GB mem dual alpha for db and dual
> PIII 512MB for indexer. mysql has 256MB of key_buffer. Loading ranks takes
> 2 minutes per urlwordsXX table. Saving citation is fast. After indexing
> several 100k documents, index -D takes about 2-3hours. Is this normal?
Yes, it is normal, 2-3 hours of saving after 1 day of indexing. index -D
at www.aspseek.com takes more time against bigger database.
> Also, indexing few pages triggers index -D at the end, which takes some
time.
> Is there a flag to only perform indexing without saving delta files,
ranking
> etc?
Running index can be safely terminated by command "index -E" from
another terminal, in this case saving delta files will not be performed.
Also you can try to run index with -n parameter.
Note, that if you don't save delta files, then searcher will not see any
changes done by "index"
Btw, what are the files in <topdir>var/aspseek/NNw/ used for?
This is files of reverse index. It is these files, that "searchd" uses
for search itself. Each file contains information about particular word.
File is created only if it's size > 1000 bytes, otherwise data is stored
in the BLOB "wordurl.urls". Name of each file is the word ID, that
stored in the field "wordurl.word_id".
>
>
> 3) I am trying to port aspseek to linux alpha.
> I am having some problems with "LONG" issues. Since I would like the db to
be
> compatible for 64bit and 32bit architecture, I have defined LONG and ULONG
to
> be int on alpha, and added TLONG and TULONG as true long for mysql access
and
> compressing. indexer works, but I still have some problems with searchd.
> I will send a patch when it is ready.
Thanks in advance.
>
> 4) Considering words with accents, there are plenty of pages where they
> are written in 7bit (ccaron like c, etc.). As for word foms with ispell,
> would it be difficult to extend searching for words with accents to
> include words without accents? For example, synonym for
> "filipčič" would be "filipcic"?
>
This is not too difficult, we will look later to it.
Or if you want to do it yourself, look at "ParseText1" function in
"parse.cpp".
>
> Best regards,
>
> Andrej
>
Thank you for interest in ASPseek
Alexander.
> --
> _____________________________________________________________
> dr. Andrej Filipcic, E-mail: [EMAIL PROTECTED]
> Department of Experimental High Energy Physics - F9
> Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> SI-1001 Ljubljana, Slovenia
> Tel.: +386-1-477-3674 Fax: +386-1-425-7074
> -------------------------------------------------------------
>