I've been doing a bit of poking around, and have noticed a couple of odd behaviours.  I'm using the most recent version from CVS but I've experienced very similar behaviour from 3.1.9.
 
First, if I do this:
# indexer foo
(after a few URLS, ^C to break)
# indexer foo
... then indexer continues with the rest of the non-indexed URLs, as I would expect.
 
But, if I do this:
# indexer foo -n 10
# indexer foo
Indexer[13686]: indexer from mnogosearch-3.1.12/MySQL started with 'foo'
Indexer[13686]: [1] Done (0 seconds)
# indexer foo -S
 
          Database statistics
 
    Status    Expired      Total
   -----------------------------
         0          0         48 Not indexed yet
       200          0         10 OK
   -----------------------------
     Total          0         58
 
... I would expect it to continue with the non-indexed URLs, but it assumes they're all up to date.  Is this expected behaviour or a bug?
 
Also, even though I have "Index no" in my config, I see these lines occasionally in my mysql query log:
 
...
                     33 Query      INSERT INTO url (url,referrer,hops,crc32,last_index_time,next_index_time,status,tag,category) VALUES ('http://barracuda.enhydra.org/media/header/enhydraFtr.gif',1,1,0,983780385,983780385,0,'','')
                     33 Query      DELETE FROM dict WHERE url_id=1
                     33 Query      UPDATE url SET status=200,last_mod_time=983780385,next_index_time=984385185,tag='',txt='...........................  ...........................  ...........................  ...........................  ...........................  ...........................                            Barracuda Project  About Barracuda  Project Mail Lists',title='The home of Barracuda at Enhydra.org',content_type='text/html',docsize=26706,keywords='j2ee enhydra lutris java application server xml open source JDDI XMLC XML Compiler wireless chtml xhtml wml J2EE',description='',crc32=1453362912,lang='',category='' WHERE rec_id=1
 
... looks like it's storing index info?  Is indexer somehow losing its configuration, or do I have something mis-configured?
 
cheers,
Damon
 
----- Original Message -----
Sent: Monday, March 05, 2001 12:02 PM
Subject: link validation config help!

Hi,
I'm trying to use mnogosearch as a link validator for a large number of sites, but I ran into a serious problem.
 
Here's my configuration, in it's simplest form:
 
DBAddr ...
DeleteBad no
Index no
CheckOnly NoMatch Regex ^http://barracuda\.enhydra\.org/.*\.html$
Realm *
 
This works beautifully, checking the existance of links outside the barracuda.enhydra.org but not following.  Except when indexer gets to this link, it follows it and starts indexing the other site.
 
<A href="http://www.sys-con.com/java/readerschoice2001/">
 
So now indexer is following through that page, all of its links, etc, and suddenly indexer is trying to check the whole world, ignoring the CheckOnly parameter.
 
I've tried different versions of the CheckOnly, with or without regex, splitting it into multiple lines, etc... nothing seems to help.  And indexer doesn't ignore the CheckOnly for all sites, just a few.
 
Any ideas?
 
(I first tried a Server-based method,
 
DBAddr ..
DeleteBad no
Index no
Folllow site
 
but this does not validate links from this site to another.)
 
-Damon

Reply via email to