Hi Benny,
Nutch is a generic web search engine, with a distributed file system support (NFS, hugest crawls and indexes), and a framework... It has plugins, you can probably design and use "FILE" plugin instead of "PROTOCOL-HTTP" However... What about hyperlinks, anchors? Indexing of presentation layers of aliens is very difficult, and fuzzy... HTML has formatting, and 95% of an extracted plain text (Select Options, Header, Footer, Menu, Reviews, ...) do not really need to be indexed... If you need to index local files, best of all is to use Lucene directly, with possible usage of org.apache.nutch.searcher package for web front-end (if you really need web front-end); especially if you have access to data layer (bypassing presentation such as HTML). For all IntrAnet related tasks, Lucene. If you have small amount of HTML, you can index your web-server directly via HTTP without performance impact, it's easy... without any logic, you will index everything... Regards, Fuad -----Original Message----- From: Benny [mailto:[EMAIL PROTECTED] Sent: Monday, August 22, 2005 2:54 PM To: [email protected] Subject: Index local file. Hi, Can someone give me some hints how index local files? I have a lot of plain HTML files (more than 50K pages, the size is around 2-3k/page). I don't prefer puting them in the web service and using url to index them. I'd like NUTCH to index them from local HD. Is it possible? if it is, what kind of url I need inject into db? for example, if you use web service, we use the http://domain/file.html How about local HD file's format? I believe no more "http", what's protocol supposed to be. These file are still in plain HTML format. Benny ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
