On 8/30/06, Chris Mattmann <[EMAIL PROTECTED]> wrote: > Hi there Tomi, > > > On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote: > > > I'm attempting to crawl a single samba mounted share. During testing, > > I'm crawling like this: > > > > ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 > > > > I'm using luke 0.6 to query and analyze the index. > > > > PROBLEMS > > > > 1.) search by file type doesn't work > > I expected that a search "file type:pdf" would have returned a list of > > files on the local filesystem, but it does not. > > I believe that the keyword is "type", so your query should be "type:pdf" > (without the quotes). I'm not positive about this either, but I believe you > have to give the fully qualified mimeType, as in "application/pdf". Not > definitely sure about that though so you should experiment.
I should have emphasized that the string I queried with is without the quotes. The "file" keyword was used because all the entries are accessible via "file://"-type links and so searching only for "file" would return all files. Filtering by type would then return all files of the given type. I tried the following query: url:file type:application/pdf but it seems I get the same set of hits regardless of what I use as type, so if I search for "url:file type:application/pdf" I get the same results as searching for "url:file type:whatever". > Additionally, in order for the mimeTypes to be indexed properly, you need to > have the index-more plugin enabled. Check your > $NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes" > and make sure that the index-more plugin is enabled there. I listed my nutch-site settings at the end of my mail: the index-more plugin is enabled. > > 2.) invalid nutch file type detection > > I see the following in the hadoop.log: > > ----------- > > 2006-08-30 15:12:07,766 WARN parse.ParseUtil - Unable to successfully > > parse content file:/mnt/bobdocs/acta.zip of type application/zip > > 2006-08-30 15:12:07,766 WARN fetcher.Fetcher - Error parsing: > > file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at > > 1024000 bytes. Parser can't handle incomplete pdf file. > > ----------- > > acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens. > > This may result from the contentType returned by the web server for > "acta.zip". Check the web server that the file is hosted on, and see what > the server responds for the contentType for that file. > > Additionally, you may want to check if magic is enabled for mimeTypes. This > allows the mimeType to be sensed through the use of hex codes compared with > the beginning of each file. I have mime.type.magic set to true. The files I index are served via samba over the LAN rather then via a web server, so no, it's not a problem of contentType. > > 3.) Why is the TextParser mapped to application/pdf and what has that > > have to do with indexing a .txt file? > > --------- > > 2006-08-30 15:12:02,593 INFO fetcher.Fetcher - fetching > > file:/mnt/bobdocs/popis-vg-procisceni.txt > > 2006-08-30 15:12:02,916 WARN parse.ParserFactory - > > ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to > > contentType application/pdf via parse-plugins.xml, but its plugin.xml > > file does not claim to support contentType: application/pdf > > --------- > > The TextParser * was * enabled as a last resort sort of means of extracting > ... I understand, thanks. Still don't know what threw the pdf-parser off, though. > > 4.) Some .doc files can't be indexed, although I can open them via > > openoffice 2 with no problems > > --------- > > 2006-08-30 15:12:02,991 WARN parse.ParseUtil - Unable to successfully > > parse content file:/mnt/bobdocs/cards2005.doc of type > > application/msword > > 2006-08-30 15:12:02,991 WARN fetcher.Fetcher - Error parsing: > > file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as > > micrsosoft document. java.lang.StringIndexOutOfBoundsException: String > > in > > dex out of range: -1024 > > --------- > > What version of MS Word were you trying to index? I believe that the POI > library used by the word parser can only handle certain versions of MS Word > documents, although I'm not positive about this. Oh, so POI doesn't use the same technology OO.org uses to access MS Office created docs? That's a shame... :( So, does anyone know which Word versions does it support? > As for 5 and 6 I'm not entirely sure about those problems. I wish you luck > in solving both of them though, and hope what I said above helps you out. Thanks for the effort, Chris. I know a little more, but still have a long way to go. Does anyone else know anything about the unsolved problems I'm facing? t.n.a. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
