Hi there Tomi,
On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote: > I'm attempting to crawl a single samba mounted share. During testing, > I'm crawling like this: > > ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 > > I'm using luke 0.6 to query and analyze the index. > > PROBLEMS > > 1.) search by file type doesn't work > I expected that a search "file type:pdf" would have returned a list of > files on the local filesystem, but it does not. I believe that the keyword is "type", so your query should be "type:pdf" (without the quotes). I'm not positive about this either, but I believe you have to give the fully qualified mimeType, as in "application/pdf". Not definitely sure about that though so you should experiment. Additionally, in order for the mimeTypes to be indexed properly, you need to have the index-more plugin enabled. Check your $NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes" and make sure that the index-more plugin is enabled there. > > 2.) invalid nutch file type detection > I see the following in the hadoop.log: > ----------- > 2006-08-30 15:12:07,766 WARN parse.ParseUtil - Unable to successfully > parse content file:/mnt/bobdocs/acta.zip of type application/zip > 2006-08-30 15:12:07,766 WARN fetcher.Fetcher - Error parsing: > file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at > 1024000 bytes. Parser can't handle incomplete pdf file. > ----------- > acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens. This may result from the contentType returned by the web server for "acta.zip". Check the web server that the file is hosted on, and see what the server responds for the contentType for that file. Additionally, you may want to check if magic is enabled for mimeTypes. This allows the mimeType to be sensed through the use of hex codes compared with the beginning of each file. > > 3.) Why is the TextParser mapped to application/pdf and what has that > have to do with indexing a .txt file? > --------- > 2006-08-30 15:12:02,593 INFO fetcher.Fetcher - fetching > file:/mnt/bobdocs/popis-vg-procisceni.txt > 2006-08-30 15:12:02,916 WARN parse.ParserFactory - > ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to > contentType application/pdf via parse-plugins.xml, but its plugin.xml > file does not claim to support contentType: application/pdf > --------- The TextParser * was * enabled as a last resort sort of means of extracting * some * content from a PDF file, that is, if the parse-pdf plugin wasn't enabled, or it failed for some reason. Since parse-text is the 2nd option for parsing PDF files, there most likely was some sort of error in the original PDF parser. The way that the ParserFactory works now is that it iterates through a preference list of parsers (specified in $NUTCH_HOME/conf/parse-plugins.xml), and tries to parse the underlying content. The first successful parse is returned back to the Fetcher. > > 4.) Some .doc files can't be indexed, although I can open them via > openoffice 2 with no problems > --------- > 2006-08-30 15:12:02,991 WARN parse.ParseUtil - Unable to successfully > parse content file:/mnt/bobdocs/cards2005.doc of type > application/msword > 2006-08-30 15:12:02,991 WARN fetcher.Fetcher - Error parsing: > file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as > micrsosoft document. java.lang.StringIndexOutOfBoundsException: String > in > dex out of range: -1024 > --------- What version of MS Word were you trying to index? I believe that the POI library used by the word parser can only handle certain versions of MS Word documents, although I'm not positive about this. As for 5 and 6 I'm not entirely sure about those problems. I wish you luck in solving both of them though, and hope what I said above helps you out. Thanks! Cheers, Chris > > 5.) MoreIndexingFilter doesn't seem to work > The relevant part of the hadoop.log file: > --------- > 2006-08-30 15:13:40,235 WARN more.MoreIndexingFilter - > file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException: > The type can not be null or empty > --------- > This happens with other file types, as well: > --------- > 2006-08-30 15:13:54,697 WARN more.MoreIndexingFilter - > file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeEx > ception: > The type can not be null or empty > --------- > > 6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the > crawl process seems to be stuck in an infinite loop and I have no way > of knowing what's going on as the .log isn't flushed until the process > finishes. > > > ENVIRONMENT > > logs/hadoop.log inspection reveals things like this: > > My (relevant) crawl settings are: > > --------- > <name>db.max.anchor.length</name> > <value>511</value> > > <name>db.max.outlinks.per.page</name> > <value>-1</value> > > <name>fetcher.server.delay</name> > <value>0</value> > > <name>fetcher.threads.fetch</name> > <value>5</value> > > <name>fetcher.verbose</name> > <value>true</value> > > <name>file.content.limit</name> > <value>102400000</value> > > <name>parser.character.encoding.default</name> > <value>iso8859-2</value> > > <name>indexer.max.title.length</name> > <value>511</value> > > <name>indexer.mergeFactor</name> > <value>5</value> > > <name>indexer.minMergeDocs</name> > <value>5</value> > > <name>plugin.includes</name> > <value>nutch-extensionpoints|protocol-(file|http)|urlfilter-regex|parse-(text| > html|msword|pdf|mspowerpoint|msexcel|rtf|js)|index-(basic|more)|query-(basic|s > ite|url|more)|summary-basic|scoring-opic</value> > > <name>searcher.max.hits</name> > <value>100</value> > --------- > > > MISC. SUGGESTIONS > > Add the following configuration options to the nutch-*.xml files: > * allow search by date or extension (with no other criteria) > * always flush log to disk (at every log addition). > > TIA, > t.n.a. ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
