Re: [aseek-users] Why aren't I indexing .pdf files?

KEVIN ZEMBOWER Tue, 14 Jan 2003 10:18:46 -0800

SUCCESS!! After "myisamchk -v -o -B *.MYI" in the mysql/aspeek12 directory, making 
sure the Converter line was:
aspseek@www:~$ grep Conv etc/aspseek.conf
Converter application/pdf text/html /usr/local/bin/pdftohtml -i -noframes -stdout $in 
> $out


and clearing and reindexing, I think it's working. Here's a new document as an example:
mysql> select * from urlword where url like '%kweng.pdf';                              
                
+--------+---------+---------+--------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+-----------------------+-----------------+----------+-----+------+-------+--------+
| url_id | site_id | deleted | url                                              | 
|next_index_time | status | crc                              | last_modified           
|      | etag                  | last_index_time | referrer | tag | hops | redir | 
|origin |
+--------+---------+---------+--------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+-----------------------+-----------------+----------+-----+------+-------+--------+
|    806 |       1 |       0 | http://www.jhuccp.org/popline/download/kweng.pdf |      
|1043171812 |    200 | 5519ee47d503f9d5a1c38d08c581c0cc | Wed, 11 Dec 2002 15:37:03 
|GMT | "bf4f-a5eb0-3df75b9f" |      1042567014 |      143 |   0 |    3 |     0 |      
|1 |
+--------+---------+---------+--------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+-----------------------+-----------------+----------+-----+------+-------+--------+
1 row in set (0.06 sec)

mysql> select url_id,wordcount,totalcount,charset,title,txt,docsize from urlwords06 
where url_id="806";
+--------+-----------+------------+------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
| url_id | wordcount | totalcount | charset    | title             | txt               
|                                                                                      
|                                                                                      
|                                                                 | docsize |
+--------+-----------+------------+------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|    806 |      6283 |      48927 | iso-8859-1 | A User`s Guide to | A User`s Guide to 
| POPLINE  Keywords  Population Information Program  Center for Communication Programs 
| Johns Hopkins Bloomberg School of Public Health  Sixth Edition  2002  Keyword 
|Dictionary ____________________________________________________________ |  679600 |
+--------+-----------+------------+------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
1 row in set (0.01 sec)

I think from this example that 6283 unique words were added to the index from 
/kweng.pdf.

I noted, as the output of sbin/index -a rolled by, that more than half of my .pdf 
documents returned errors. Yet, these documents seem to be good .pdf documents to me, 
i.e., I can open and view them. Is there any converter which has a higher success rate?

Thank you, Kir, for all your help and suggestions.

-Kevin


Ahh, now I feel that I'm making some progress. Here's the output:
aspseek@www:~$ sbin/index -ma -T http://www.jhuccp.org/pr/j52/j52.pdf                
Loading configuration from /usr/local/aspseek/etc/db.conf
Loading configuration from /usr/local/aspseek/etc/ucharset.conf
Loading configuration from /usr/local/aspseek/etc/stopwords.conf
Loading configuration from /usr/local/aspseek/etc/aspseek.conf
Adding URL: http://www.jhuccp.org/pr/j52/j52.pdf
exec /usr/local/bin/pdftohtml -i -noframes -stdout /tmp/asijzPVQh > /tmp/asoQiP3Wq
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Status: OK
index process finished.
aspseek@www:~$ 

Kir, can you tell me if these errors are related to pdftohtml or to aspseek?

I also tried this line in etc/aspseek.conf:
Converter application/pdf text/html /usr/local/bin/pdftohtml -i -noframes -stdout $in 
$out.html

I got the same result as above, however.

Thanks again, for looking into this for me.

-Kevin

>>> [EMAIL PROTECTED] 01/14/03 12:16PM >>>
Sorry, I forgot to add that you should have add -ma flags to force 
reindexing. Otherwise index will look at next_index_time (which was
set to last_index_time + Period) and do not index document because
next_index_time > now()

So, try 'index -ma -T http://www.jhuccp.org/pr/j52/j52.pdf'.

By the way, check your db with myisamchk...just in case.

KEVIN ZEMBOWER wrote:
> Thanks, again, Kir, for your offer of help.
> 
> I had already fixed the case in the link to http://www.jhuccp.org/pr/j52/J52.pdf 
>from the document http://www.jhuccp.org/popreporter/2002/08-19.shtml while I was 
>writing the note, but forgot to update my snippet. Sorry for the confusion.
> 
> Here's the output you asked for:
> aspseek@www:~$ sbin/index -T http://www.jhuccp.org/pr/j52/j52.pdf 
> Loading configuration from /usr/local/aspseek/etc/db.conf
> Loading configuration from /usr/local/aspseek/etc/ucharset.conf
> Loading configuration from /usr/local/aspseek/etc/stopwords.conf
> Loading configuration from /usr/local/aspseek/etc/aspseek.conf
> Adding URL: http://www.jhuccp.org/pr/j52/j52.pdf 
> Status: OK
> index process finished.
> aspseek@www:~$ 
> 
> And yet:
> mysql> select * from urlword where url like '%pdf' limit 1;
> 
>+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> | url_id | site_id | deleted | url                                  | 
>next_index_time | status | crc                              | last_modified           
>      | etag                     | last_index_time | referrer | tag | hops | redir | 
>origin |
> 
>+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> |   5244 |       1 |       0 | http://www.jhuccp.org/pr/j52/j52.pdf |      
>1043167913 |    200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 
>GMT | "20d0ae-1328a5-3e15c308" |      1042496187 |     2794 |   0 |    5 |     0 |    
>  0 |
> 
>+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> 1 row in set (0.07 sec)
> 
> mysql> select * from urlwords12 where url_id="5244";       
> Empty set (0.00 sec)
> 
> mysql> 
> 
> Just for good measure, I checked all the urlwordsNN tables for '5244' without luck.
> 
> Are there any extra diagnostics or logging I could turn on to help with this 
>problem? Any other suggestions?
> 
> Thanks, again, for your help.
> 
> -Kevin
> 
> 
>>>>[EMAIL PROTECTED] 01/14/03 11:50AM >>>
>>>
>>I've got links to .pdf files in my .shtml files which seem to be indexed fine:
>>aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs 
>fgrep .pdf |head                     
>>//var/www/main/htdocs/popreporter/2002/08-19.shtml:                            | <a 
>href="http://www.jhuccp.org/pr/j52/J52.pdf";>PDF</a></p>
> 
> 
> The first thing I notice is document is named J52.pdf while it is available
> as j52.pdf from your server. Notice the case!
> 
> 
>><snip>
>>
>>There are 14 rows in the urlword table which end in '.pdf':
>>mysql> select * from urlword where url like '%pdf'; 
>>+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
>>| url_id | site_id | deleted | url                                                   
>    | next_index_time | status | crc                              | last_modified     
>            | etag                     | last_index_time | referrer | tag | hops | 
>redir | origin |
>>+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
>>|   5244 |       1 |       0 | http://www.jhuccp.org/pr/j52/j52.pdf                  
>    |      1043164839 |    200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 
>17:06:16 GMT | "20d0ae-1328a5-3e15c308" |      1042496187 |     2794 |   0 |    5 |   
>  0 |      0 |
>><snip>
>>14 rows in set (0.06 sec)
>>
>>The "200" in the status column indicates that it was found.
>>
>>For this first .pdf document, I computed the urlwords table name as 'urlwords12' 
>(5244 mod 16)
> 
> 
> That is right answer, although ASPseek uses 'urlid & 15', which is the same 
> but much more efficient ;)
> 
> , but there's no entry in that table for this url_id:
> 
>>mysql> select * from urlwords12 where url_id="5244";
>>Empty set (0.00 sec)
>>
>>This leads me to believe that .pdf documents are being checked, but not indexed.
>>
>>When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I 
>get HTML output, so pdftohtml seems to be working okay.
>>
>>Can anyone suggest any other diagnostics that could help me solve this problem? Any 
>thoughts or comments?
>>
>>Thank you all in advance for your help.
> 
> 
> Hmm...
> 
> Try index -T http://www.jhuccp.org/pr/j52/j52.pdf and see what happens.
> 


-- 
== kir_at_asplinux.ru == 7551596_at_ICQ == 6722750_at_sms.beemail.ru ==

Dream like you'll live forever...Love like you've never been hurt...
Work like you don't need the money...and Dance like nobody is watching!
        -- Satchel Paige

Re: [aseek-users] Why aren't I indexing .pdf files?

Reply via email to