I have to admit not having followed this problem so far, but when Natalya writes "I don't get error message, but I have never .pdf-Files in my search-List!!!", I wonder if a simple misunderstanding is the cause for the trouble...
For my understanding htdig doesn't index all the files in a subdirectory but only follows URLs which it finds on "webpages". So if no URL points to a PDF-File, no PDF will be indexed and therefore no PDF will show up in the search list.
I wanted to index PDFs once and specially created a single PHP File that would browse through the subdirectories recursively and simple create a page with links to all the PDF Files found.
I pointed htdig to this particular file and "voila" - all of the PDF Files were indexed. So maybe this is the problem - no links to the PDF Files.
If this point had already been cleared in previous mails concerning this issue, I apologize for not having read these.
All the best! Martin [EMAIL PROTECTED]
David Adams schrieb:
Thank you, that output establishes that htdig is reading a .pdf file.
The next question is: what is it doing with it? To answer that we need to see what you have in your configuration file.
David Adams Corporate Information Services Information Systems Services University of Southampton
----- Original Message ----- From: "Natalya Kolesnikova" <[EMAIL PROTECTED]>
To: "Gilles Detillieux" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, October 08, 2003 10:22 AM
Subject: Re: [htdig] PDF-SEARCH
Thank you very much for your help! I don't get error message, but I have never .pdf-Files in my
search-List!!!
Hier is htdig -ivvv output when start_url is a single PDF file. What is wrong???
[EMAIL PROTECTED]:~> htdig -ivvv
1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i ntroduction_to_IPR.pdf New server: intranet.panasonic.de, 80 Retrieval command for http://intranet.panasonic.de/robots.txt: GET /robots.txt H TTP/1.0 User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) Host: intranet.panasonic.de
Header line: HTTP/1.1 200 OK Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 Header line: ETag: "44005-e7-3b82d9e0" Header line: Accept-Ranges: bytes Header line: Content-Length: 231 Header line: Connection: close Header line: Content-Type: text/plain Header line: returnStatus = 0 Read 231 from document Read a total of 231 bytes Parsing robots.txt file using myname = htdig Robots.txt line: # exclude help system from robots Robots.txt line: User-agent: * Found 'user-agent' line: * Robots.txt line: Disallow: /manual/ Found 'disallow' line: /manual/ Robots.txt line: Disallow: /doc/ Found 'disallow' line: /doc/ Robots.txt line: Disallow: /gif/ Found 'disallow' line: /gif/ Robots.txt line: # but allow htdig to index our doc-tree Robots.txt line: User-agent: susedig Found 'user-agent' line: susedig Robots.txt line: Disallow: Robots.txt line: # disallow stress test Robots.txt line: user-agent: stress-agent Found 'user-agent' line: stress-agent Robots.txt line: Disallow: / Pattern: /manual/|/doc/|/gif/ pushed pick: intranet.panasonic.de, # servers = 1
0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/int rodu
ction_to_IPR.pdf: Retrieval command for http://intranet.panasonic.de/pel/ipr/tra ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET /pel/ipr/training_course /IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) Host: intranet.panasonic.de
Header line: HTTP/1.1 200 OK Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 Header line: ETag: "314005-51e38-3f4f381f" Header line: Accept-Ranges: bytes Header line: Content-Length: 335416 Header line: Connection: close Header line: Content-Type: application/pdf Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 7736 from document Read a total of 335416 bytes size = 335416 pick: intranet.panasonic.de, # servers = 1 [EMAIL PROTECTED]:~>
According to Natalya Kolesnikova:
may be I am stupid, but it doesn't work by me! Can somebody help me? I
have
tried with acroread and with external parser xpdf, but it doesn't
work!!!!
I need the Installation Guide!!! :)))
See http://www.htdig.org/FAQ.html#q4.9
That is the installation guide for PDF indexing. If you've carefully
read
and implemented everything recommended there, and checked out FAQs 5.2 and 5.37 as David recommended (twice), then please provide more details, such as what error messages you get, or give us an excerpt of
htdig -ivvv
output when start_url is set to point to just one single PDF file.
There are dozens of potential points of failure in this process, so
simply
saying "it doesn't work" gives us no information that can help pinpoint which point of failure is the one that needs to be addressed.
Also, make sure you have links in your HTML files to all PDF files you want to index. (See http://www.htdig.org/FAQ.html#q5.25)
-- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general
-- NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService
Jetzt kostenlos anmelden unter http://www.gmx.net
+++ GMX - die erste Adresse f�r Mail, Message, More! +++
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

