According to Ted Stresen-Reuter: > 1. > On our intranet we have some pdf files that were made in adobe acrobat. The > files contain hyperlinks to other files. My guess is that the pdf2html (or > is it pdf2text) converter doesn't know how to follow links. Does anyone know > of a product that does or am I relegated to listing each pdf individually if > I want it to be indexed?
The usual external parser scripts make use of pdftotext, which comes with the xpdf package. It only extracts plain text from the PDF documents. I've been meaning to try pdftohtml (http://pdftohtml.sourceforge.net/) but haven't yet had the chance. I don't know if it will extract hypertext links from the PDFs, but it's worth a try. If you do try it, please let us know how it goes. You may want to retrofit this tool into doc2html.pl, so you get all the wrapper script handling of arguments and such. If pdftohtml doesn't do it, I've found the following trick seems to find links in PDFs, but without the description text, so you could try working this into an external converter script: strings file.pdf | sed -n 's|^/URI (\(.*\)).*|<link href="\1">|p' > 2. > Our intranet is sprinkled with links back to the firm directory. For > example, on each department's home page is a list of the staff that works in > that department and a link back to each persons profile in the firm > directory. Likewise, when viewing an individual's profile in the firm > directory, you see a list of other members of the same department with links > to their individual profiles as well. When I conduct a search on > 'technology', expecting to see the Information Technology Home Page listed > first (it is the title of the page, has Information Technology in the > description and keywords and has an h1 tag at what is essentially the start > of the page) and yet it appears at the end of the list with only one star. > Each individual, however, is listed at the start of the list and with 5 > stars. Is this because there are far more pages that point to each > individual's profile than there are that point to the Information Technology > Home Page and if so, what do the developers of htdig recommend changing so > that the home page comes up first? Try lowering the value of the backlink_factor attribute (see http://www.htdig.org/attrs.html#backlink_factor) to see if that helps (no need to reindex). Also, if the word "technology" appears in the link description text for the links to any of the individuals' pages, that will greatly boost their score if description_factor is still at the default value. If that's the case, you can lower this factor too (you'll have to reindex if using the 3.1.x series), or change the descriptions and reindex. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

