hi list, has no one an idea what's wrong in my settings or my way of thinking? regards marcel
Am 01.04.2010 um 14:18 schrieb toocrazym...@gmx.de: > hi list, > > i am new in using nutch (solr and nutch integrated in solr too) and my > english is not well :/ in my company i have the task to replace the > media-wiki search function with a solr (include nutch) search function. > furthermore i have to write a small program (java) which read ONE > configuration file (xml), that configure all necessary things > (solr-server-address, nutch-path, regex-urlfilter, seeds, paths, > nutch-site.xml and so on by overwrite the solr- and nutch-configurations). > the solr-server shall indexing the configured paths of intranet (file, > smb-shares, svn, ...) and hold the index and nutch shall crawl the configured > websites (html, pdf, doc, ...) and indexing these to solr-server. currently i > use the whole-web-crawl-script shown below. the indexing of the plain/text > websites into solr is not a problem, but when i would like to crawl a website > (included pdf files) there is (may be) a unresolved problem for me. my > example configurations for crawling websites (with pdfs and so on) see also > below. after the website is crawled, stored and indexed i start a query to > solr to find *ecv.pdf > (http://127.0.0.1:8983/solr/select/?q=*ecv.pdf&qt=standard), but nothing > found :( but when i start the query to find *ctan* > (http://127.0.0.1:8983/solr/select/?q=*ctan*&qt=standard) it matches and i > get i.a. (the highlighted output contains the url and a fragsize of 600 > characters): > > > http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/ > <em>http://www.ctan.org</em>/. Installation ------------ After unzipping the > package call: $ latex ecv.ins This will be extract some files (ecv.cls, > ecvNLS.sty...). Create a directory beneath your TeX installation preferably > create tex/latex/ecv and copy all files of the package and the extracted > files to that directory. Call: mktexlsr Templates --------- In template.zip > document templates for a german and an english CV can be found. Just unzip > the template.zip, cd to template and run make to get the pdf. /Bernd > Haberstumpf, <po...@kabatrinker.de>, 2007-01-06 /Christoph Neumann, <c.p > > but i like results as regular solr results to query on pdfs like these: > > /Users/USER_NAME/home/studies/semester-vi/bachelor/bachelorthesis/main.pdf > Symbolverzeichnis VIII Tabellenverzeichnis IX Listingverzeichnis X 1 > Einleitung 1 2 Kapitel 1 2 3 Griechische Symbole 2 4 Kapitel 2 3 5 > Zusammenfassung und Schlusswort 4 Anhang XI Literaturverzeichnis XII Glossar > XIII V Abbildungsverzeichnis VI Abkürzungsverzeichnis AD . . . . . . . . . . > . . . . . . . . . . Active Directory CD . . . . . . . . . . . . . . . . . . . > . Compact Disc MS . . . . . . . . . . . . . . . . . . . . Microsoft VII > Symbolverzeichnis . . . . . . . . . . . . . . . . . . . . . . Eine beliebige > Zahl, mit der der nachfol- gende Ausdruck multipliziert wird. ? . . . . . . . > . . . . . . . . . . . . . . . Ein beliebiger Winkel. . . . . . . . . . . . . > . . . . . . . . . . Die Kreiszahl. VIII Tabellenverzeichnis IX > > this means, that only the website is indexed and not the pdfs :( but i > explicitly said in the nutch-site.xml, that nutch have to parse pdfs. can > anyone help me? > > best regards marcel :) > > urls/seed.txt: > > http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/ > > regex-urlfilter.txt: > > -^(https|telnet|file|ftp|mailto): > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|wmf|mpg|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > -[...@=] > +^http://([a-z0-9]*\.)*ctan.org/tex-archive/macros/latex/contrib/ecv/ > +.* > > nutch-site.xml: > > <configuration> > <property> > <name>http.robots.agents</name> > <value>robots</value> > <description>see nutch-default.xml</description> > </property> > > <property> > <name>http.agent.name</name> > <value>name</value> > <description>see nutch-default.xml</description> > </property> > > <property> > <name>http.agent.description</name> > <value>desc</value> > <description>see nutch-default.xml</description> > </property> > > <property> > <name>http.agent.url</name> > <value>url</value> > <description>see nutch-default.xml</description> > </property> > > <property> > <name>http.agent.email</name> > <value>email</value> > <description>see nutch-default.xml</description> > </property> > <property> > <name>generate.max.per.host</name> > <value>30</value> > <description>The maximum number of urls per host in a single > fetchlist. -1 if unlimited.</description> > </property> > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|zip)|index-(basic|anchor|more)scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is > excluded. > In any case you need at least include the > nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain > text via HTTP, > and basic indexing and search plugins. In order to use > HTTPS please enable > protocol-httpclient, but be aware of possible > intermittent problems with the > underlying commons-httpclient library. > </description> > </property> > <property> > <name>fetcher.verbose</name> > <value>true</value> > <description>If true, fetcher will log more > verbosely.</description> > </property> > <property> > <name>http.verbose</name> > <value>true</value> > <description>If true, HTTP will log more > verbosely.</description> > </property> > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external > hosts > will be ignored. This is an effective way to limit the > crawl to include > only initially injected hosts, without creating complex > URLFilters. > </description> > </property> > <property> > <name>db.ignore.internal.links</name> > <value>false</value> > <description>If true, when adding new links to a page, links > from > the same host are ignored. This is an effective way to > limit the > size of the link database, keeping only the highest > quality > links. > </description> > </property> > </configuration> > > > the crawl script: > > > #!/bin/bash > > clear > > echo "*********************************************" > echo "* This program crawl whole world web sites *" > echo "* and indexing these to Apache Solr. *" > echo "*********************************************" > > # define functions > > function functionPrintMessage { > msgType=$1 > msgBody=$2 > echo "[$msgType]:$msgBody" > } > > function functionPlayBeep { > echo -ne "\a" > } > > # set and initiate basic variables > > export > JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home > > crawlSizeBefore="" > crawlSizeAfter="" > generateDepth=3 > directoryUrls="urls" > directoryCrawl="crawl" > directoryCrawldb="$directoryCrawl/crawldb" > directoryLinkdb="$directoryCrawl/linkdb" > directorySegments="$directoryCrawl/segments" > directoryLogs="logs" > binaryNutch="bin/nutch" > urlToSolr="http://127.0.0.1:8983/solr/" > logHodoop="$directoryLogs/hodoop.log" > tmpSegment="" > tstampBefore="" > tstampAfter="" > tstampDiff="" > > # clear something > > if [ -e $logHodoop ]; then > rm $logHodoop > fi > > if [ ! -e $directoryCrawl/ ]; then > functionPrintMessage "STDERR" "THE $directoryCrawl FOLDER > DOESN'T EXISTS!" > mkdir $directoryCrawl/ > functionPrintMessage "STDOUT" "I'VE BEEN DONE FOR YOU :P" > fi > > # begin crawling > > tstampBefore=$(date +%s) > > functionPrintMessage "NUTCH" "INJECTING" > echo > sleep 5 > $binaryNutch inject $directoryCrawldb $directoryUrls > > echo > functionPrintMessage "STDOUT" "START GENERATING UNTIL DEPTH > $generateDepth" > > for (( i=0; i<$generateDepth; i++ )); do > > echo > functionPlayBeep > functionPrintMessage "STDOUT" "CURRENT DEPTH IS `expr $i + 1`, > `expr $generateDepth - $i - 1` ITERATIONS REMAIN." > > echo > ls -al $directoryCrawl/ > crawlSizeBefore=`du -hs $directoryCrawl` > > echo > functionPrintMessage "NUTCH" "GENERATING" > echo > sleep 5 > $binaryNutch generate $directoryCrawldb $directorySegments > -stats -adddays 0 -topN 1000 > > echo > ls -al $directoryCrawl/ > crawlSizeAfter=`du -hs $directoryCrawl` > > echo > functionPrintMessage "NUTCH" "FETCHING" > echo > sleep 5 > export tmpSegment=$directorySegments/`ls -tr > $directorySegments|tail -1` > $binaryNutch fetch $tmpSegment -noParsing > > echo > functionPrintMessage "NUTCH" "PARSING" > echo > sleep 5 > $binaryNutch parse $tmpSegment > > functionPrintMessage "NUTCH" "UPDATING" > echo > sleep 5 > $binaryNutch updatedb $directoryCrawldb $tmpSegment > -filter -normalize > > done > > echo > functionPrintMessage "NUTCH" "INVERTING" > echo > sleep 5 > $binaryNutch invertlinks $directoryLinkdb -dir $directorySegments > > echo > functionPrintMessage "NUTCH" "INDEXING" > echo > sleep 5 > $binaryNutch solrindex $urlToSolr $directoryCrawldb $directoryLinkdb > $directorySegments/* > > tstampAfter=$(date +%s) > tstampDiff=$(( $tstampAfter - $tstampBefore )) > > echo > functionPrintMessage "STDOUT" "CRAWLING FINISHED AFTER $tstampDiff > SECONDS (`expr $tstampDiff / 60` MINUTES)!" > echo > exit 0