I'm attempting to crawl a single samba mounted share. During testing, I'm crawling like this:
./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 I'm using luke 0.6 to query and analyze the index. PROBLEMS 1.) search by file type doesn't work I expected that a search "file type:pdf" would have returned a list of files on the local filesystem, but it does not. 2.) invalid nutch file type detection I see the following in the hadoop.log: ----------- 2006-08-30 15:12:07,766 WARN parse.ParseUtil - Unable to successfully parse content file:/mnt/bobdocs/acta.zip of type application/zip 2006-08-30 15:12:07,766 WARN fetcher.Fetcher - Error parsing: file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at 1024000 bytes. Parser can't handle incomplete pdf file. ----------- acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens. 3.) Why is the TextParser mapped to application/pdf and what has that have to do with indexing a .txt file? --------- 2006-08-30 15:12:02,593 INFO fetcher.Fetcher - fetching file:/mnt/bobdocs/popis-vg-procisceni.txt 2006-08-30 15:12:02,916 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType application/pdf via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/pdf --------- 4.) Some .doc files can't be indexed, although I can open them via openoffice 2 with no problems --------- 2006-08-30 15:12:02,991 WARN parse.ParseUtil - Unable to successfully parse content file:/mnt/bobdocs/cards2005.doc of type application/msword 2006-08-30 15:12:02,991 WARN fetcher.Fetcher - Error parsing: file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as micrsosoft document. java.lang.StringIndexOutOfBoundsException: String in dex out of range: -1024 --------- 5.) MoreIndexingFilter doesn't seem to work The relevant part of the hadoop.log file: --------- 2006-08-30 15:13:40,235 WARN more.MoreIndexingFilter - file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty --------- This happens with other file types, as well: --------- 2006-08-30 15:13:54,697 WARN more.MoreIndexingFilter - file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty --------- 6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the crawl process seems to be stuck in an infinite loop and I have no way of knowing what's going on as the .log isn't flushed until the process finishes. ENVIRONMENT logs/hadoop.log inspection reveals things like this: My (relevant) crawl settings are: --------- <name>db.max.anchor.length</name> <value>511</value> <name>db.max.outlinks.per.page</name> <value>-1</value> <name>fetcher.server.delay</name> <value>0</value> <name>fetcher.threads.fetch</name> <value>5</value> <name>fetcher.verbose</name> <value>true</value> <name>file.content.limit</name> <value>102400000</value> <name>parser.character.encoding.default</name> <value>iso8859-2</value> <name>indexer.max.title.length</name> <value>511</value> <name>indexer.mergeFactor</name> <value>5</value> <name>indexer.minMergeDocs</name> <value>5</value> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-(file|http)|urlfilter-regex|parse-(text|html|msword|pdf|mspowerpoint|msexcel|rtf|js)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value> <name>searcher.max.hits</name> <value>100</value> --------- MISC. SUGGESTIONS Add the following configuration options to the nutch-*.xml files: * allow search by date or extension (with no other criteria) * always flush log to disk (at every log addition). TIA, t.n.a.