[aseek-users] Why aren't I indexing .pdf files?

KEVIN ZEMBOWER Tue, 14 Jan 2003 08:11:31 -0800

I'm trying to get my first installation of aspseek working. It seems to index HTML 
documents fine, but now I'm trying to expand into .pdf documents.


My aspseek.conf file looks like this:
aspseek@www:~$ grep -v '^[[:space:]]*$' etc/aspseek.conf |grep -v "^#"
Include db.conf
Include ucharset.conf
Include stopwords.conf
Converter application/pdf text/html /usr/local/bin/pdftohtml -i -noframes -stdout $in 
> $out
DeleteNoServer no
Server  http://www.jhuccp.org/
DeltaBufferSize 64
Disallow /cgi-bin/ \.cgi /nph
Disallow \.tif$  \.au$   \.mov$  \.jpe$  \.cur$  \.qt$
Disallow \.b$    \.sh$   \.md5$   \.rpm$
Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
Disallow \.lha$  \.lzh$  \.tar\.Z$  \.rar$  \.zoo$
Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$ \.xpm$ \.xbm$
Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$  \.png$
Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
Disallow \.rtf$  \.cdf$  \.ps$
Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
Disallow \.o$ \.a$ \.la$ \.so$ \.so\.[0-9]$
Disallow \.pat$ \.pm$ \.m4$ \.am$
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
Disallow [^:]//
Disallow mmc/.*\.php
Disallow PHPTEST
aspseek@www:~$ 

I've got links to .pdf files in my .shtml files which seem to be indexed fine:
aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs 
fgrep .pdf |head                     
//var/www/main/htdocs/popreporter/2002/08-19.shtml:                            | <a 
href="http://www.jhuccp.org/pr/j52/J52.pdf";>PDF</a></p>
<snip>

There are 14 rows in the urlword table which end in '.pdf':
mysql> select * from urlword where url like '%pdf'; 
+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| url_id | site_id | deleted | url                                                     
|  | next_index_time | status | crc                              | last_modified       
|          | etag                     | last_index_time | referrer | tag | hops | 
|redir | origin |
+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
|   5244 |       1 |       0 | http://www.jhuccp.org/pr/j52/j52.pdf                    
|  |      1043164839 |    200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 
|17:06:16 GMT | "20d0ae-1328a5-3e15c308" |      1042496187 |     2794 |   0 |    5 |   
|  0 |      0 |
<snip>
14 rows in set (0.06 sec)

The "200" in the status column indicates that it was found.

For this first .pdf document, I computed the urlwords table name as 'urlwords12' (5244 
mod 16), but there's no entry in that table for this url_id:
mysql> select * from urlwords12 where url_id="5244";
Empty set (0.00 sec)

This leads me to believe that .pdf documents are being checked, but not indexed.

When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I 
get HTML output, so pdftohtml seems to be working okay.

Can anyone suggest any other diagnostics that could help me solve this problem? Any 
thoughts or comments?

Thank you all in advance for your help.

-Kevin Zembower

-----
E. Kevin Zembower
Unix Administrator
Johns Hopkins University/Center for Communications Programs
111 Market Place, Suite 310
Baltimore, MD  21202
410-659-6139

[aseek-users] Why aren't I indexing .pdf files?

Reply via email to