Nutch has plugins for PDF, MSWORD, JavaScript, HTML, Text, RSS
You may enable it using nustch-site.xml and copying/modifying section
from nutch-default.xml:
<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml|pdf|msword|js)|index-basic|query-(basic|site|url)</value>
...

parse-(text|html|pdf|msword|js)

 
You may also design own plugins (Excel, PowerPoint).
 
As I understand main purpose of a parser is to extract plain text for
indexing, extract Outlink[], and metadata. If your PDF have metadata
with Author, probably should work...
 
I am currently working on specific Anchor processing (indexing,
analyzing). You also have access to WebDB database with Link objects,
and Anchor text elements linking to the Page.
 
Check WebDBReader, WebDBAnchors.

HTDIG can't do this...



-----Original Message-----
From: Valmir Macário [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 3:18 PM
To: [email protected]
Subject: Nuch capability


Someone can answer me if i can index other types of documents like doc,
pdf,
ppt, xls... And i can return them from a repository at intranet with
some
caracteristcs witch i can choice, like the autor. I'm doubts if i use
nuch
or htdig for this pourpose. Someone can help-me?


Reply via email to