I'm trying to get my first installation of aspseek working. It seems to index HTML documents fine, but now I'm trying to expand into .pdf documents.
My aspseek.conf file looks like this: aspseek@www:~$ grep -v '^[[:space:]]*$' etc/aspseek.conf |grep -v "^#" Include db.conf Include ucharset.conf Include stopwords.conf Converter application/pdf text/html /usr/local/bin/pdftohtml -i -noframes -stdout $in > $out DeleteNoServer no Server http://www.jhuccp.org/ DeltaBufferSize 64 Disallow /cgi-bin/ \.cgi /nph Disallow \.tif$ \.au$ \.mov$ \.jpe$ \.cur$ \.qt$ Disallow \.b$ \.sh$ \.md5$ \.rpm$ Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$ Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$ Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$ \.xpm$ \.xbm$ Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$ Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$ Disallow \.vrml$ \.wrl$ \.png$ Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$ Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$ Disallow \.rtf$ \.cdf$ \.ps$ Disallow \.ai$ \.eps$ \.ppt$ \.hqx$ Disallow \.cpt$ \.bms$ \.oda$ \.tcl$ Disallow \.o$ \.a$ \.la$ \.so$ \.so\.[0-9]$ Disallow \.pat$ \.pm$ \.m4$ \.am$ Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$ Disallow [^:]// Disallow mmc/.*\.php Disallow PHPTEST aspseek@www:~$ I've got links to .pdf files in my .shtml files which seem to be indexed fine: aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs fgrep .pdf |head //var/www/main/htdocs/popreporter/2002/08-19.shtml: | <a href="http://www.jhuccp.org/pr/j52/J52.pdf">PDF</a></p> <snip> There are 14 rows in the urlword table which end in '.pdf': mysql> select * from urlword where url like '%pdf'; +--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ | url_id | site_id | deleted | url | | next_index_time | status | crc | last_modified | | etag | last_index_time | referrer | tag | hops | |redir | origin | +--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+ | 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf | | 1043164839 | 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 |17:06:16 GMT | "20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | | 0 | 0 | <snip> 14 rows in set (0.06 sec) The "200" in the status column indicates that it was found. For this first .pdf document, I computed the urlwords table name as 'urlwords12' (5244 mod 16), but there's no entry in that table for this url_id: mysql> select * from urlwords12 where url_id="5244"; Empty set (0.00 sec) This leads me to believe that .pdf documents are being checked, but not indexed. When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I get HTML output, so pdftohtml seems to be working okay. Can anyone suggest any other diagnostics that could help me solve this problem? Any thoughts or comments? Thank you all in advance for your help. -Kevin Zembower ----- E. Kevin Zembower Unix Administrator Johns Hopkins University/Center for Communications Programs 111 Market Place, Suite 310 Baltimore, MD 21202 410-659-6139
