intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Tomi NA Wed, 30 Aug 2006 12:25:45 -0700

I'm attempting to crawl a single samba mounted share. During testing,
I'm crawling like this:


./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20

I'm using luke 0.6 to query and analyze the index.

PROBLEMS

1.) search by file type doesn't work
I expected that a search "file type:pdf" would have returned a list of
files on the local filesystem, but it does not.

2.) invalid nutch file type detection
I see the following in the hadoop.log:
-----------
2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
parse content file:/mnt/bobdocs/acta.zip of type application/zip
2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
1024000 bytes. Parser can't handle incomplete pdf file.
-----------
acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.

3.) Why is the TextParser mapped to application/pdf and what has that
have to do with indexing a .txt file?
---------
2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
file:/mnt/bobdocs/popis-vg-procisceni.txt
2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
contentType application/pdf via parse-plugins.xml, but its plugin.xml
file does not claim to support contentType: application/pdf
---------

4.) Some .doc files can't be indexed, although I can open them via
openoffice 2 with no problems
---------
2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
parse content file:/mnt/bobdocs/cards2005.doc of type
application/msword
2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
in
dex out of range: -1024
---------

5.) MoreIndexingFilter doesn't seem to work
The relevant part of the hadoop.log file:
---------
2006-08-30 15:13:40,235 WARN  more.MoreIndexingFilter -
file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException:
The type can not be null or empty
---------
This happens with other file types, as well:
---------
2006-08-30 15:13:54,697 WARN  more.MoreIndexingFilter -
file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeException:
The type can not be null or empty
---------

6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the
crawl process seems to be stuck in an infinite loop and I have no way
of knowing what's going on as the .log isn't flushed until the process
finishes.


ENVIRONMENT

logs/hadoop.log inspection reveals things like this:

My (relevant) crawl settings are:

---------
 <name>db.max.anchor.length</name>
 <value>511</value>

 <name>db.max.outlinks.per.page</name>
 <value>-1</value>

 <name>fetcher.server.delay</name>
 <value>0</value>

 <name>fetcher.threads.fetch</name>
 <value>5</value>

 <name>fetcher.verbose</name>
 <value>true</value>

 <name>file.content.limit</name>
 <value>102400000</value>

 <name>parser.character.encoding.default</name>
 <value>iso8859-2</value>

 <name>indexer.max.title.length</name>
 <value>511</value>

 <name>indexer.mergeFactor</name>
 <value>5</value>

 <name>indexer.minMergeDocs</name>
 <value>5</value>

 <name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-(file|http)|urlfilter-regex|parse-(text|html|msword|pdf|mspowerpoint|msexcel|rtf|js)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>

 <name>searcher.max.hits</name>
 <value>100</value>
---------


MISC. SUGGESTIONS

Add the following configuration options to the nutch-*.xml files:
* allow search by date or extension (with no other criteria)
* always flush log to disk (at every log addition).

TIA,
t.n.a.

intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Reply via email to