Re: [Nutch-general] intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Tomi NA Thu, 31 Aug 2006 02:28:51 -0700

On 8/30/06, Chris Mattmann <[EMAIL PROTECTED]> wrote:
> Hi there Tomi,
>
>
> On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote:
>
> > I'm attempting to crawl a single samba mounted share. During testing,
> > I'm crawling like this:
> >
> > ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
> >
> > I'm using luke 0.6 to query and analyze the index.
> >
> > PROBLEMS
> >
> > 1.) search by file type doesn't work
> > I expected that a search "file type:pdf" would have returned a list of
> > files on the local filesystem, but it does not.
>
> I believe that the keyword is "type", so your query should be "type:pdf"
> (without the quotes). I'm not positive about this either, but I believe you
> have to give the fully qualified mimeType, as in "application/pdf". Not
> definitely sure about that though so you should experiment.


I should have emphasized that the string I queried with is without the
quotes. The "file" keyword was used because all the entries are
accessible via "file://"-type links and so searching only for "file"
would return all files. Filtering by type would then return all files
of the given type.
I tried the following query:
url:file type:application/pdf
but it seems I get the same set of hits regardless of what I use as
type, so if I search for "url:file type:application/pdf" I get the
same results as searching for "url:file type:whatever".

> Additionally, in order for the mimeTypes to be indexed properly, you need to
> have the index-more plugin enabled. Check your
> $NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes"
> and make sure that the index-more plugin is enabled there.

I listed my nutch-site settings at the end of my mail: the index-more
plugin is enabled.

> > 2.) invalid nutch file type detection
> > I see the following in the hadoop.log:
> > -----------
> > 2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
> > parse content file:/mnt/bobdocs/acta.zip of type application/zip
> > 2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
> > file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
> > 1024000 bytes. Parser can't handle incomplete pdf file.
> > -----------
> > acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.
>
> This may result from the contentType returned by the web server for
> "acta.zip". Check the web server that the file is hosted on, and see what
> the server responds for the contentType for that file.
>
> Additionally, you may want to check if magic is enabled for mimeTypes. This
> allows the mimeType to be sensed through the use of hex codes compared with
> the beginning of each file.

I have mime.type.magic set to true. The files I index are served via
samba over the LAN rather then via a web server, so no, it's not a
problem of contentType.

> > 3.) Why is the TextParser mapped to application/pdf and what has that
> > have to do with indexing a .txt file?
> > ---------
> > 2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
> > file:/mnt/bobdocs/popis-vg-procisceni.txt
> > 2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
> > ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
> > contentType application/pdf via parse-plugins.xml, but its plugin.xml
> > file does not claim to support contentType: application/pdf
> > ---------
>
> The TextParser * was * enabled as a last resort sort of means of extracting
> ...

I understand, thanks. Still don't know what threw the pdf-parser off, though.

> > 4.) Some .doc files can't be indexed, although I can open them via
> > openoffice 2 with no problems
> > ---------
> > 2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
> > parse content file:/mnt/bobdocs/cards2005.doc of type
> > application/msword
> > 2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
> > file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
> > micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
> > in
> > dex out of range: -1024
> > ---------
>
> What version of MS Word were you trying to index? I believe that the POI
> library used by the word parser can only handle certain versions of MS Word
> documents, although I'm not positive about this.

Oh, so POI doesn't use the same technology OO.org uses to access MS
Office created docs? That's a shame... :(
So, does anyone know which Word versions does it support?

> As for 5 and 6 I'm not entirely sure about those problems. I wish you luck
> in solving both of them though, and hope what I said above helps you out.

Thanks for the effort, Chris. I know a little more, but still have a
long way to go.
Does anyone else know anything about the unsolved problems I'm facing?

t.n.a.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Reply via email to