Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "Features" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/Features?action=diff&rev1=20&rev2=21 == Questions and Answers == - *What kind of searches does Nutch support? (quoted, nested, truncation, wildcarding [and where], Boolean), - * "...." (phrase search?), + (what is this for?), - (negation) and fieldname:term. No "AND" or "OR". The and-logic is implied. - - *Is stemming an option? - * According to the [[http://www.lucenebook.com/|Lucene in Action]] book: "Nutch does not use stemming or term aliasing of any kind. Search engines have not historically done much stemming, but it is a question that comes up regularly." -- page 329 - - *What kind of stemming does Nutch use? (and can you add exceptions/changes?) - * See previous answer :) - - *Does Nutch support Boolean operators? (can you use Google-like plus or minus or are you stuck with 1990s terms?) - * No - *How does the search engine handle punctuation and special characters? (and what's configurable?) * They are treated like a space. *Which document formats are supported? - * Guessing from the names of the available parser plugins, this is probably it. However, only the plain text and HTML are enabled by default. Edit conf/nutch-site.xml and change the value of plugin.includes property to include the plugins for the document types that you want Nutch to handle: + * This is directly linked to the available parser plugins mentioned above, however only some are enabled by default as most of the parsing is now delegated to Tika in an attempt to clean up the Nutch codebase. Edit conf/nutch-site.xml and change the value of plugin.includes property to include the plugins for the document types that you want Nutch to handle. Additionally have a look at conf/parse-plugins.xml for more details of plugin implementations. To recap: - * Plain Text (plugin: parse-text) + * Plain Text (plugin: tika) - * HTML (parse-html) + * HTML/XHTML+XML (parse-html/tika) - * XML (parse-xml) uses XPath and namespaces to do the mapping between XML elements and Lucene fields. + * XML (parse-Tika/feed) uses XPath and namespaces to do the mapping between XML elements and index fields. * Java``Script (for extracting links only?) (parse-js) - * OpenOfice.org ODF (parse-oo) parses Open Office and Star Office documents. + * OpenOfice.org ODF (parse-tika) parses Open Office and Star Office documents. - * Microsoft Power Point, the .ppt file (parse-mspowerpoint) + * Microsoft Power Point, the .ppt file (parse-tika) - * Microsoft Word, the .doc file (parse-msword) + * Microsoft Word, the .doc file (parse-tika) - * Adobe PDF (parse-pdf) + * Adobe PDF (parse-tika) - * RSS (parse-rss) + * RSS (parse-feed/tika) - * RTF (parse-rtf) + * RTF (parse-tika) - * MP3 (?) Is there any text in MP3? (parse-mp3) (JR: Sure, the mp3 itself contains the ID3v1 or ID3v2 tags which contain song information like + * MP3 (parse-tika) The mp3 itself contains the ID3v1 or ID3v2 tags which contain metadata song information like - title, artist, album, comments, etc. The useful information needed to search mp3s) + title, artist, album, comments, etc. The useful information needed to search mp3s - * ZIP (?) This seems to expand the zip of plain text files and return the concatenated text. (parse-zip) + * ZIP (parse-zip) This seems to expand the zip of plain text files and return the concatenated text. - == Questions without Answers ==

