According to Gregory McCann: > After running htDig and htMerge as a daily cron job for quite a while > with no problems, I started getting the following error messages... > > !! Error: Copying of text from this document is not allowed. > !! Error (0): PDF file is damaged - attempting to reconstruct xref table... > !! Error: Top-level pages object is wrong type (null) > !! Error: Couldn't read page catalog > > Okay, so it's having problems with at least one PDF file - but which > one? I have hundreds of PDF files on my site and this doesn't give me > a clue where to begin looking. Surely at the time these error messages > are generated, the program knows the name of the file causing the error. > Couldn't that filename be included in the error message? > > Am I missing something here? Is there some way I can tell from this > which file needs to be fixed?
The error message come from pdftotext. Even if it did show the file name it was trying to read at the time, it still wouldn't be helpful, because it's reading the PDF from a temporary file at that point. If you run "htdig -v", you'll see what URL it's working on at the time the error occurs. You may not even have to do that, though. Just search for the biggest PDF you have, and set max_doc_size to something larger than its size. The error above is because the PDF was truncated. See also http://www.htdig.org/FAQ.html#q5.2 -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: Dice - The leading online job board for high-tech professionals. Search and apply for tech jobs today! http://seeker.dice.com/seeker.epl?rel_code=31 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

