hi list, i am new in using nutch (solr and nutch integrated in solr too) and my english is not well :/ in my company i have the task to replace the media-wiki search function with a solr (include nutch) search function. furthermore i have to write a small program (java) which read ONE configuration file (xml), that configure all necessary things (solr-server-address, nutch-path, regex-urlfilter, seeds, paths, nutch-site.xml and so on by overwrite the solr- and nutch-configurations). the solr-server shall indexing the configured paths of intranet (file, smb-shares, svn, ...) and hold the index and nutch shall crawl the configured websites (html, pdf, doc, ...) and indexing these to solr-server. currently i use the whole-web-crawl-script shown below. the indexing of the plain/text websites into solr is not a problem, but when i would like to crawl a website (included pdf files) there is (may be) a unresolved problem for me. my example configurations for crawling websites (with pdfs and so on) see also below. after the website is crawled, stored and indexed i start a query to solr to find *ecv.pdf (http://127.0.0.1:8983/solr/select/?q=*ecv.pdf&qt=standard), but nothing found :( but when i start the query to find *ctan* (http://127.0.0.1:8983/solr/select/?q=*ctan*&qt=standard) it matches and i get i.a. (the highlighted output contains the url and a fragsize of 600 characters):
http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/ <em>http://www.ctan.org</em>/. Installation ------------ After unzipping the package call: $ latex ecv.ins This will be extract some files (ecv.cls, ecvNLS.sty...). Create a directory beneath your TeX installation preferably create tex/latex/ecv and copy all files of the package and the extracted files to that directory. Call: mktexlsr Templates --------- In template.zip document templates for a german and an english CV can be found. Just unzip the template.zip, cd to template and run make to get the pdf. /Bernd Haberstumpf, <po...@kabatrinker.de>, 2007-01-06 /Christoph Neumann, <c.p but i like results as regular solr results to query on pdfs like these: /Users/USER_NAME/home/studies/semester-vi/bachelor/bachelorthesis/main.pdf Symbolverzeichnis VIII Tabellenverzeichnis IX Listingverzeichnis X 1 Einleitung 1 2 Kapitel 1 2 3 Griechische Symbole 2 4 Kapitel 2 3 5 Zusammenfassung und Schlusswort 4 Anhang XI Literaturverzeichnis XII Glossar XIII V Abbildungsverzeichnis VI Abkürzungsverzeichnis AD . . . . . . . . . . . . . . . . . . . . Active Directory CD . . . . . . . . . . . . . . . . . . . . Compact Disc MS . . . . . . . . . . . . . . . . . . . . Microsoft VII Symbolverzeichnis . . . . . . . . . . . . . . . . . . . . . . Eine beliebige Zahl, mit der der nachfol- gende Ausdruck multipliziert wird. ? . . . . . . . . . . . . . . . . . . . . . . Ein beliebiger Winkel. . . . . . . . . . . . . . . . . . . . . . . Die Kreiszahl. VIII Tabellenverzeichnis IX this means, that only the website is indexed and not the pdfs :( but i explicitly said in the nutch-site.xml, that nutch have to parse pdfs. can anyone help me? best regards marcel :) urls/seed.txt: http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/ regex-urlfilter.txt: -^(https|telnet|file|ftp|mailto): -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|wmf|mpg|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[...@=] +^http://([a-z0-9]*\.)*ctan.org/tex-archive/macros/latex/contrib/ecv/ +.* nutch-site.xml: <configuration> <property> <name>http.robots.agents</name> <value>robots</value> <description>see nutch-default.xml</description> </property> <property> <name>http.agent.name</name> <value>name</value> <description>see nutch-default.xml</description> </property> <property> <name>http.agent.description</name> <value>desc</value> <description>see nutch-default.xml</description> </property> <property> <name>http.agent.url</name> <value>url</value> <description>see nutch-default.xml</description> </property> <property> <name>http.agent.email</name> <value>email</value> <description>see nutch-default.xml</description> </property> <property> <name>generate.max.per.host</name> <value>30</value> <description>The maximum number of urls per host in a single fetchlist. -1 if unlimited.</description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|zip)|index-(basic|anchor|more)scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> <property> <name>fetcher.verbose</name> <value>true</value> <description>If true, fetcher will log more verbosely.</description> </property> <property> <name>http.verbose</name> <value>true</value> <description>If true, HTTP will log more verbosely.</description> </property> <property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> </configuration> the crawl script: #!/bin/bash clear echo "*********************************************" echo "* This program crawl whole world web sites *" echo "* and indexing these to Apache Solr. *" echo "*********************************************" # define functions function functionPrintMessage { msgType=$1 msgBody=$2 echo "[$msgType]:$msgBody" } function functionPlayBeep { echo -ne "\a" } # set and initiate basic variables export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home crawlSizeBefore="" crawlSizeAfter="" generateDepth=3 directoryUrls="urls" directoryCrawl="crawl" directoryCrawldb="$directoryCrawl/crawldb" directoryLinkdb="$directoryCrawl/linkdb" directorySegments="$directoryCrawl/segments" directoryLogs="logs" binaryNutch="bin/nutch" urlToSolr="http://127.0.0.1:8983/solr/" logHodoop="$directoryLogs/hodoop.log" tmpSegment="" tstampBefore="" tstampAfter="" tstampDiff="" # clear something if [ -e $logHodoop ]; then rm $logHodoop fi if [ ! -e $directoryCrawl/ ]; then functionPrintMessage "STDERR" "THE $directoryCrawl FOLDER DOESN'T EXISTS!" mkdir $directoryCrawl/ functionPrintMessage "STDOUT" "I'VE BEEN DONE FOR YOU :P" fi # begin crawling tstampBefore=$(date +%s) functionPrintMessage "NUTCH" "INJECTING" echo sleep 5 $binaryNutch inject $directoryCrawldb $directoryUrls echo functionPrintMessage "STDOUT" "START GENERATING UNTIL DEPTH $generateDepth" for (( i=0; i<$generateDepth; i++ )); do echo functionPlayBeep functionPrintMessage "STDOUT" "CURRENT DEPTH IS `expr $i + 1`, `expr $generateDepth - $i - 1` ITERATIONS REMAIN." echo ls -al $directoryCrawl/ crawlSizeBefore=`du -hs $directoryCrawl` echo functionPrintMessage "NUTCH" "GENERATING" echo sleep 5 $binaryNutch generate $directoryCrawldb $directorySegments -stats -adddays 0 -topN 1000 echo ls -al $directoryCrawl/ crawlSizeAfter=`du -hs $directoryCrawl` echo functionPrintMessage "NUTCH" "FETCHING" echo sleep 5 export tmpSegment=$directorySegments/`ls -tr $directorySegments|tail -1` $binaryNutch fetch $tmpSegment -noParsing echo functionPrintMessage "NUTCH" "PARSING" echo sleep 5 $binaryNutch parse $tmpSegment functionPrintMessage "NUTCH" "UPDATING" echo sleep 5 $binaryNutch updatedb $directoryCrawldb $tmpSegment -filter -normalize done echo functionPrintMessage "NUTCH" "INVERTING" echo sleep 5 $binaryNutch invertlinks $directoryLinkdb -dir $directorySegments echo functionPrintMessage "NUTCH" "INDEXING" echo sleep 5 $binaryNutch solrindex $urlToSolr $directoryCrawldb $directoryLinkdb $directorySegments/* tstampAfter=$(date +%s) tstampDiff=$(( $tstampAfter - $tstampBefore )) echo functionPrintMessage "STDOUT" "CRAWLING FINISHED AFTER $tstampDiff SECONDS (`expr $tstampDiff / 60` MINUTES)!" echo exit 0