Thanks, again, Kir, for your offer of help. I had already fixed the case in the link to http://www.jhuccp.org/pr/j52/J52.pdf from the document http://www.jhuccp.org/popreporter/2002/08-19.shtml while I was writing the note, but forgot to update my snippet. Sorry for the confusion.
Here's the output you asked for: aspseek@www:~$ sbin/index -T http://www.jhuccp.org/pr/j52/j52.pdf Loading configuration from /usr/local/aspseek/etc/db.conf Loading configuration from /usr/local/aspseek/etc/ucharset.conf Loading configuration from /usr/local/aspseek/etc/stopwords.conf Loading configuration from /usr/local/aspseek/etc/aspseek.conf Adding URL: http://www.jhuccp.org/pr/j52/j52.pdf Status: OK index process finished. aspseek@www:~$ And yet: mysql> select * from urlword where url like '%pdf' limit 1; +--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ | url_id | site_id | deleted | url | next_index_time || status | crc | last_modified | etag | | last_index_time | referrer | tag | hops | redir | origin | +--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ | 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf | 1043167913 || 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 GMT | |"20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | 0 | 0 | +--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ 1 row in set (0.07 sec) mysql> select * from urlwords12 where url_id="5244"; Empty set (0.00 sec) mysql> Just for good measure, I checked all the urlwordsNN tables for '5244' without luck. Are there any extra diagnostics or logging I could turn on to help with this problem? Any other suggestions? Thanks, again, for your help. -Kevin >>> [EMAIL PROTECTED] 01/14/03 11:50AM >>> > I've got links to .pdf files in my .shtml files which seem to be indexed fine: > aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs >fgrep .pdf |head > //var/www/main/htdocs/popreporter/2002/08-19.shtml: | <a >href="http://www.jhuccp.org/pr/j52/J52.pdf">PDF</a></p> The first thing I notice is document is named J52.pdf while it is available as j52.pdf from your server. Notice the case! > <snip> > > There are 14 rows in the urlword table which end in '.pdf': > mysql> select * from urlword where url like '%pdf'; > >+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ > | url_id | site_id | deleted | url > | next_index_time | status | crc | last_modified > | etag | last_index_time | referrer | tag | hops | >redir | origin | > >+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ > | 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf > | 1043164839 | 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 >17:06:16 GMT | "20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | > 0 | 0 | > <snip> > 14 rows in set (0.06 sec) > > The "200" in the status column indicates that it was found. > > For this first .pdf document, I computed the urlwords table name as 'urlwords12' >(5244 mod 16) That is right answer, although ASPseek uses 'urlid & 15', which is the same but much more efficient ;) , but there's no entry in that table for this url_id: > mysql> select * from urlwords12 where url_id="5244"; > Empty set (0.00 sec) > > This leads me to believe that .pdf documents are being checked, but not indexed. > > When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I >get HTML output, so pdftohtml seems to be working okay. > > Can anyone suggest any other diagnostics that could help me solve this problem? Any >thoughts or comments? > > Thank you all in advance for your help. Hmm... Try index -T http://www.jhuccp.org/pr/j52/j52.pdf and see what happens. -- == kir_at_asplinux.ru == 7551596_at_ICQ == 6722750_at_sms.beemail.ru == Dream like you'll live forever...Love like you've never been hurt... Work like you don't need the money...and Dance like nobody is watching! -- Satchel Paige
