Nutch version 0.5 is now available for download from:

  http://www.nutch.org/release/

Note that several file formats have incompatibly changed. If you are already using Nutch, it is highly recommended that you recreate all Nutch data from scratch.


There are many new features in this release, including:

1. Support for plugins, including:

a. Protocols for fetching documents. HTTP, FTP, and file protocol implementations are included.

b. Parsers for different document formats. HTML, plain text, Microsoft Word and PDF parser implementations are included.

c. HTML parse filters, which can extract metadata from the parsed HTML for subsequent indexing.

c. Indexing and query filters. These can be used to add new fields to the index and search them. The previous default indexing and query behaviour is now provided through the index-basic and query-basic plugins. An indexing and query filter are provided which implement site search, using the "site:" query keyword.

2. Automatic language identification.

The languageidentifier plugin defines an HTML parser filter which analyzes pages to determine which language they are written in. Indexing and query filters are then provided to permit search by language code. For example, adding "lang:de" to a query constrains results to those in German.

3. Creative Commons license metadata searching.

The creativecommons plugin defines an HTML parser filter that extracts Creative Commons license data from pages. Then it defines indexing and query filters to permit search of the extracted metadata.

4. Excessive site hit elimination.

A maximum of two (by default) hits from a site are displayed. When this limit is exceeded, the displayed hits have a "more hits from site" link.

5. Segment merging tool.

A tool is added to merge multiple segment directories into a single segment directory.


Numerous other changes and bugs were fixed. More details can be found in http://www.nutch.org/CHANGES.txt.


Enjoy!

Doug





-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to