Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KurosakaTeruhiko: http://wiki.apache.org/nutch/Features The comment on the change is: Removed descriptions about charset as they were not accurate ------------------------------------------------------------------------------ *How does the search engine handle punctuation and special characters? (and what's configurable?) *Which document formats are supported? * Guessing from the names of the available parser plugins, this is probably it. However, only the plain text and HTML are enabled by default. Edit conf/nutch-site.xml and change the value of plugin.includes property to include the plugins for the document types that you want Nutch to handle: - * Plain Text (in a fixed preconfigured charset only) (plugin: parse-text) - * HTML (in most any charsets) (parse-html) + * Plain Text (plugin: parse-text) + * HTML (parse-html) * JavaScript (for extracting links only?) (parse-js) * Microsoft Power Point, the .ppt file (parse-mspowerpoint) * Microsoft Word, the .doc file (parse-msword)
