Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KurosakaTeruhiko: http://wiki.apache.org/nutch/Features ------------------------------------------------------------------------------ *How does the search engine handle punctuation and special characters? (and what's configurable?) *Which document formats are supported? + * Guessing from the names of the available parser plugins, this is probably it: + *Plain Text (in a fixed preconfigured charset only) + * HTML (in most any charsets) + * JavaScript (for extracting links only?) + * Microsoft Power Point, the .ppt file + * Microsoft Word, the .doc file + * Adobe PDF + * RSS + * RTF + * MP3 (?) Is there any text in MP3? + * ZIP (?) This seems to expand the zip of plain text files and return the concatenated text. + *What post-coordination options are available? (hey Karen, what does this mean?) *How easy is Nutch to configure?
