Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Tomi NA Thu, 31 Aug 2006 02:28:00 -0700

On 8/30/06, Chris Mattmann <[EMAIL PROTECTED]> wrote:

Hi there Tomi,

On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote:

> I'm attempting to crawl a single samba mounted share. During testing,
> I'm crawling like this:
>
> ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
>
> I'm using luke 0.6 to query and analyze the index.
>
> PROBLEMS
>
> 1.) search by file type doesn't work
> I expected that a search "file type:pdf" would have returned a list of
> files on the local filesystem, but it does not.

I believe that the keyword is "type", so your query should be "type:pdf"
(without the quotes). I'm not positive about this either, but I believe you
have to give the fully qualified mimeType, as in "application/pdf". Not
definitely sure about that though so you should experiment.


I should have emphasized that the string I queried with is without the
quotes. The "file" keyword was used because all the entries are
accessible via "file://"-type links and so searching only for "file"
would return all files. Filtering by type would then return all files
of the given type.
I tried the following query:
url:file type:application/pdf
but it seems I get the same set of hits regardless of what I use as
type, so if I search for "url:file type:application/pdf" I get the
same results as searching for "url:file type:whatever".

Additionally, in order for the mimeTypes to be indexed properly, you need to
have the index-more plugin enabled. Check your
$NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes"
and make sure that the index-more plugin is enabled there.


I listed my nutch-site settings at the end of my mail: the index-more
plugin is enabled.

> 2.) invalid nutch file type detection
> I see the following in the hadoop.log:
> -----------
> 2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
> parse content file:/mnt/bobdocs/acta.zip of type application/zip
> 2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
> file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
> 1024000 bytes. Parser can't handle incomplete pdf file.
> -----------
> acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.

This may result from the contentType returned by the web server for
"acta.zip". Check the web server that the file is hosted on, and see what
the server responds for the contentType for that file.

Additionally, you may want to check if magic is enabled for mimeTypes. This
allows the mimeType to be sensed through the use of hex codes compared with
the beginning of each file.


I have mime.type.magic set to true. The files I index are served via
samba over the LAN rather then via a web server, so no, it's not a
problem of contentType.

> 3.) Why is the TextParser mapped to application/pdf and what has that
> have to do with indexing a .txt file?
> ---------
> 2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
> file:/mnt/bobdocs/popis-vg-procisceni.txt
> 2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
> contentType application/pdf via parse-plugins.xml, but its plugin.xml
> file does not claim to support contentType: application/pdf
> ---------

The TextParser * was * enabled as a last resort sort of means of extracting
...


I understand, thanks. Still don't know what threw the pdf-parser off, though.

> 4.) Some .doc files can't be indexed, although I can open them via
> openoffice 2 with no problems
> ---------
> 2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
> parse content file:/mnt/bobdocs/cards2005.doc of type
> application/msword
> 2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
> file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
> micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
> in
> dex out of range: -1024
> ---------

What version of MS Word were you trying to index? I believe that the POI
library used by the word parser can only handle certain versions of MS Word
documents, although I'm not positive about this.


Oh, so POI doesn't use the same technology OO.org uses to access MS
Office created docs? That's a shame... :(
So, does anyone know which Word versions does it support?

As for 5 and 6 I'm not entirely sure about those problems. I wish you luck
in solving both of them though, and hope what I said above helps you out.


Thanks for the effort, Chris. I know a little more, but still have a
long way to go.
Does anyone else know anything about the unsolved problems I'm facing?

t.n.a.

Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Reply via email to