Re: [aseek-users] Why aren't I indexing .pdf files?

Kir Kolyshkin Tue, 14 Jan 2003 09:16:25 -0800

Sorry, I forgot to add that you should have add -ma flags to force reindexing. Otherwise index will look at next_index_time (which was
set to last_index_time + Period) and do not index document because
next_index_time > now()

So, try 'index -ma -T http://www.jhuccp.org/pr/j52/j52.pdf'.

By the way, check your db with myisamchk...just in case.

KEVIN ZEMBOWER wrote:

Thanks, again, Kir, for your offer of help.

I had already fixed the case in the link to http://www.jhuccp.org/pr/j52/J52.pdf from the document http://www.jhuccp.org/popreporter/2002/08-19.shtml while I was writing the note, but forgot to update my snippet. Sorry for the confusion.

Here's the output you asked for:
aspseek@www:~$ sbin/index -T http://www.jhuccp.org/pr/j52/j52.pdf
Loading configuration from /usr/local/aspseek/etc/db.conf
Loading configuration from /usr/local/aspseek/etc/ucharset.conf
Loading configuration from /usr/local/aspseek/etc/stopwords.conf
Loading configuration from /usr/local/aspseek/etc/aspseek.conf
Adding URL: http://www.jhuccp.org/pr/j52/j52.pdf
Status: OK
index process finished.
aspseek@www:~$
And yet:
mysql> select * from urlword where url like '%pdf' limit 1;
+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| url_id | site_id | deleted | url | next_index_time | status | crc | last_modified | etag | last_index_time | referrer | tag | hops | redir | origin |
+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf | 1043167913 | 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 GMT | "20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | 0 | 0 |
+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
1 row in set (0.07 sec)

mysql> select * from urlwords12 where url_id="5244"; Empty set (0.00 sec)

mysql>
Just for good measure, I checked all the urlwordsNN tables for '5244' without luck.

Are there any extra diagnostics or logging I could turn on to help with this problem? Any other suggestions?

Thanks, again, for your help.

-Kevin
[EMAIL PROTECTED] 01/14/03 11:50AM >>>
I've got links to .pdf files in my .shtml files which seem to be indexed fine:
aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs fgrep .pdf |head //var/www/main/htdocs/popreporter/2002/08-19.shtml: | <a href="http://www.jhuccp.org/pr/j52/J52.pdf";>PDF</a></p>
The first thing I notice is document is named J52.pdf while it is available
as j52.pdf from your server. Notice the case!
<snip>

There are 14 rows in the urlword table which end in '.pdf':
mysql> select * from urlword where url like '%pdf'; +--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| url_id | site_id | deleted | url | next_index_time | status | crc | last_modified | etag | last_index_time | referrer | tag | hops | redir | origin |
+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf | 1043164839 | 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 GMT | "20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | 0 | 0 |
<snip>
14 rows in set (0.06 sec)

The "200" in the status column indicates that it was found.

For this first .pdf document, I computed the urlwords table name as 'urlwords12' (5244 mod 16)

That is right answer, although ASPseek uses 'urlid & 15', which is the same but much more efficient ;)

, but there's no entry in that table for this url_id:
mysql> select * from urlwords12 where url_id="5244";
Empty set (0.00 sec)

This leads me to believe that .pdf documents are being checked, but not indexed.

When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I get HTML output, so pdftohtml seems to be working okay.

Can anyone suggest any other diagnostics that could help me solve this problem? Any thoughts or comments?

Thank you all in advance for your help.
Hmm...

Try index -T http://www.jhuccp.org/pr/j52/j52.pdf and see what happens.


--
== kir_at_asplinux.ru == 7551596_at_ICQ == 6722750_at_sms.beemail.ru ==

Dream like you'll live forever...Love like you've never been hurt...
Work like you don't need the money...and Dance like nobody is watching!
       -- Satchel Paige

Re: [aseek-users] Why aren't I indexing .pdf files?

Reply via email to