Hi Payo, You need to add the right plugin to your nutch configuration file. Here is an extraction from my installation:
NUTCH_HOME\conf\nutch-site.xml: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|ontology|protocol-ftp|protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|rtf|msword|js|mspowerpoint|msexcel|oo|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-lucene|scoring-opic</value> </property> ... Using the above configuration, I am able to index text, html, pbd, excel, etc. Not sure about XML, I think there is already an enhacement request for this in JIRA. I hope this helps, Sergio ----- Original Message ---- From: payo <[EMAIL PROTECTED]> To: [email protected] Sent: Friday, 19 October, 2007 4:16:20 PM Subject: Re: Indexing documents Goethe wrote: > > > > payo wrote: >> >> Hi >> >> my questions are >> >> 1.- Nutch can index documents PDF, HTML and XML? >> >> 2.- Nutxh can index remote documents? >> >> thanks >> > > Yes to both questions, and for the first question Nutch already comes with > the plugins necessary to index those files types. > > where i can obtain information on this? -- View this message in context: http://www.nabble.com/Indexing-documents-tf4653264.html#a13295436 Sent from the Nutch - User mailing list archive at Nabble.com. ___________________________________________________________ Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html
