Nutch has plugins for PDF, MSWORD, JavaScript, HTML, Text, RSS You may enable it using nustch-site.xml and copying/modifying section from nutch-default.xml: <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h tml|pdf|msword|js)|index-basic|query-(basic|site|url)</value> ...
parse-(text|html|pdf|msword|js) You may also design own plugins (Excel, PowerPoint). As I understand main purpose of a parser is to extract plain text for indexing, extract Outlink[], and metadata. If your PDF have metadata with Author, probably should work... I am currently working on specific Anchor processing (indexing, analyzing). You also have access to WebDB database with Link objects, and Anchor text elements linking to the Page. Check WebDBReader, WebDBAnchors. HTDIG can't do this... -----Original Message----- From: Valmir Macário [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 3:18 PM To: [email protected] Subject: Nuch capability Someone can answer me if i can index other types of documents like doc, pdf, ppt, xls... And i can return them from a repository at intranet with some caracteristcs witch i can choice, like the autor. I'm doubts if i use nuch or htdig for this pourpose. Someone can help-me?
