Re: [htdig] Verifying PDF indexed documents

Gilles Detillieux Thu, 04 Oct 2001 10:21:43 -0700

According to Curtis J. Peredina:
> Thank you for your reply. Im a bit net to htdig (looking to replace
> Verity) so Im still catching up on the documentation.
> 
> This is what I get with the -vvvv
> 
> Tag: A HREF="/customers/wsipc.pdf">, matched 2
> A tag: pos = 2, position = ="/customers/wsipc.pdf">
> word: Washington@640
> word: School@640
> word: public@641
> word: schools@641
> word: adopt@641
> word: ASP@642
> word: technology@642
> word: from@643
> word: ASPen@643
> href: http://www.progress.com/customers/wsipc.pdf (Washington School
> public scho
> ols adopt ASP technology from ASPen ...)
> resolving 'http://www.progress.com/customers/wsipc.pdf'
> 
> 
> I cant find any indication of a segfault in the output, and it seems as
> if it's grabbing words.

Yeah, but the words are those in the document that links to the PDF,
not the PDF itself.  You're still looking at the wrong section of the
debugging output.  When htdig encounters links to other documents,
all it does is queue them up for later retrieval.  That's what those
"pushing" messages were all about.  When it actually pops one of these
URLs off the queue and fetches the document may be quite a lot later,
depending on how much was in the queue already.  Find the part where htdig
actually fetches a PDF, and then look to see what it does after that.
Even if acroread isn't segfaulting, there may be other clues there.

> Gilles Detillieux wrote:
> > According to Curtis J. Peredina:
> > > OS: Solaris 2.7
> > >
> > > Latest htdig
> > 
> > What's the latest?  3.1.5?  3.2.0b3?  3.1.6 or 3.2.0b4 development
> > snapshot?  Based on what you wrote below, I'd guess 3.1.5.

So, did I guess right?

> > > Im running the dig, and I have the correct pdf_parser parameter with the
> > > path to acroread.
> > >
> > > Im also running with -vv to a logfile.
> > >
> > > I cant seem to search any PDF documents, they are not being displayed.
> > > Is there any way to verify they are being indexed??
> > >
> > > Here's a log excerpt:
> > >
> > >    pushing http://www.z.com/success/index.htm
> > > +A tag: pos = 2, position = ="/products/pavail.pdf">
> > >
> > >    pushing http://www.z.com/products/pavail.pdf
> > > +A tag: pos = 2, position = ="/products/lifecycle.htm">
> > 
> > That says what htdig does when it finds a link to a PDF, but not a whole
> > lot more.  That it's pushing the link says the file isn't being excluded
> > by bad_extensions or exclude_urls, but what would be more informative is
> > what htdig does when it actually attempts to fetch and index the document.
> > Try -vvv and look for errors there.
> > 
> > You didn't mention which version of acroread you're using,
> > but if it's version 4, I'll bet it's crashing on you.  See
> > http://www.htdig.org/FAQ.html#q5.2, http://www.htdig.org/FAQ.html#q4.9
> > and http://www.htdig.org/FAQ.html#q1.13

I think most/all the reports of acroread 4 crashing were on Linux systems,
so I don't know if the Solaris version was more solid.  What version
of acroread do you have?  Does it work to convert some of your PDFs
to PostScript?  E.g.: "acroread -toPostScript wsipc.pdf"

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Verifying PDF indexed documents

Reply via email to