Hi Jim,
It turns out it wasn't the problem I thought it was it was.
I did some more playing around and I found out what the problem is. The
match was not an _exact_ match for the phrase I was looking for. I was
looking for a three word phrase and two words matched and one was just
similar. In those conditions it seems an excerpt isn't shown for the near
phrase match. I didn't intend to look for a fuzzy match, I just didn't
remember the phrase correctly when I searched for it as a test
When I type in an exact match for something at the end of the document it all works fine.
Thanks for you're suggestions all the same
Regards, SimonB
Jim wrote:
On Sat, 23 Oct 2004, Simon Blandford wrote:
I installed the CVS version of Htdig on 15-Oct-2004. I have a batch of PDF files I am indexing that are about 30kB each. Everything seems to work OK except if I search for a phrase that I know is right at the end of one of the documents I get the message "None of the search words were found in the top of this document." for each of the hits. I have increased max_head_length to 100kB, 1MB, 10MB but whatever I do it just won't work. I have also tried reducing max_head_length down to a low value and searching for phases in the middle of a document to check it is doing anything at all. It is.
Sounds like you probably have this covered already, but just to be thorough... Are you 100% certain that the databases are being rebuilt from scratch after changing the value of max_head_length? The attribute applies only to htdig and a failure to reindex all relevant documents would result in the type of problem you are seeing.
If you are sure that the databases are being rebuild, you might try running the htdump program and taking a look at the document database. The text used for excerpts will be part of the dump and allow you to see if the text you are looking for is at least making it into the database.
The only other thing I can think of to suggest at the moment is that you try running one of PDFs through the parser outside of htdig to see what sort of output is being generated. If for example megabytes of garbage were being generated, that might push the content you are interested in beyond the max_head_length limit. You might also try indexing a couple of the PDFs with a lot of -v options (the more you add, the more verbose the output). It might be that the output will contain something that will point you in the right direction.
Jim
------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general