Re: problem: crawl pdfs from a website and index these to solr

toocrazymail Mon, 05 Apr 2010 23:55:37 -0700

hi list, has no one an idea what's wrong in my settings or my way of thinking? 
regards marcel


Am 01.04.2010 um 14:18 schrieb toocrazym...@gmx.de:

> hi list,
> 
> i am new in using nutch (solr and nutch integrated in solr too) and my 
> english is not well :/ in my company i have the task to replace the 
> media-wiki search function with a solr (include nutch) search function. 
> furthermore i have to write a small program (java) which read ONE 
> configuration file (xml), that configure all necessary things 
> (solr-server-address, nutch-path, regex-urlfilter, seeds, paths, 
> nutch-site.xml and so on by overwrite the solr- and nutch-configurations). 
> the solr-server shall indexing the configured paths of intranet (file, 
> smb-shares, svn, ...) and hold the index and nutch shall crawl the configured 
> websites (html, pdf, doc, ...) and indexing these to solr-server. currently i 
> use the whole-web-crawl-script shown below. the  indexing of the plain/text 
> websites into solr is not a problem, but when i would like to crawl a website 
> (included pdf files) there is (may be) a unresolved problem for me. my 
> example configurations for crawling websites (with pdfs and so on) see also 
> below. after the website is crawled, stored and indexed i start a query to 
> solr to find *ecv.pdf 
> (http://127.0.0.1:8983/solr/select/?q=*ecv.pdf&qt=standard), but nothing 
> found :( but when i start the query to find *ctan* 
> (http://127.0.0.1:8983/solr/select/?q=*ctan*&qt=standard) it matches and i 
> get i.a. (the highlighted output contains the url and a fragsize of 600 
> characters):
> 
> 
> http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/ 
> <em>http://www.ctan.org</em>/. Installation ------------ After unzipping the 
> package call: $ latex ecv.ins This will be extract some files (ecv.cls, 
> ecvNLS.sty...). Create a directory beneath your TeX installation preferably 
> create tex/latex/ecv and copy all files of the package and the extracted 
> files to that directory. Call: mktexlsr Templates --------- In template.zip 
> document templates for a german and an english CV can be found. Just unzip 
> the template.zip, cd to template and run make to get the pdf. /Bernd 
> Haberstumpf, <po...@kabatrinker.de>, 2007-01-06 /Christoph Neumann, <c.p
> 
> but i like results as regular solr results to query on pdfs like these:
> 
> /Users/USER_NAME/home/studies/semester-vi/bachelor/bachelorthesis/main.pdf
> Symbolverzeichnis VIII Tabellenverzeichnis IX Listingverzeichnis X 1 
> Einleitung 1 2 Kapitel 1 2 3 Griechische Symbole 2 4 Kapitel 2 3 5 
> Zusammenfassung und Schlusswort 4 Anhang XI Literaturverzeichnis XII Glossar 
> XIII V Abbildungsverzeichnis VI Abkürzungsverzeichnis AD . . . . . . . . . . 
> . . . . . . . . . . Active Directory CD . . . . . . . . . . . . . . . . . . . 
> . Compact Disc MS . . . . . . . . . . . . . . . . . . . . Microsoft VII 
> Symbolverzeichnis . . . . . . . . . . . . . . . . . . . . . . Eine beliebige 
> Zahl, mit der der nachfol- gende Ausdruck multipliziert wird. ? . . . . . . . 
> . . . . . . . . . . . . . . . Ein beliebiger Winkel. . . . . . . . . . . . . 
> . . . . . . . . . . Die Kreiszahl. VIII Tabellenverzeichnis IX
> 
> this means, that only the website is indexed and not the pdfs :( but i 
> explicitly said in the nutch-site.xml, that nutch have to parse pdfs. can 
> anyone help me?
> 
> best regards marcel :)
> 
> urls/seed.txt:
> 
> http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/
> 
> regex-urlfilter.txt:
> 
> -^(https|telnet|file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|wmf|mpg|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> -[...@=]
> +^http://([a-z0-9]*\.)*ctan.org/tex-archive/macros/latex/contrib/ecv/
> +.*
> 
> nutch-site.xml:
> 
> <configuration>
>    <property>
>        <name>http.robots.agents</name>
>        <value>robots</value>
>        <description>see nutch-default.xml</description>
>    </property>
> 
>    <property>
>        <name>http.agent.name</name>
>        <value>name</value>
>        <description>see nutch-default.xml</description>
>    </property>
> 
>    <property>
>        <name>http.agent.description</name>
>        <value>desc</value>
>        <description>see nutch-default.xml</description>
>    </property>
> 
>    <property>
>        <name>http.agent.url</name>
>        <value>url</value>
>        <description>see nutch-default.xml</description>
>    </property>
> 
>    <property>
>        <name>http.agent.email</name>
>        <value>email</value>
>        <description>see nutch-default.xml</description>
>    </property>
>    <property>
>        <name>generate.max.per.host</name>
>        <value>30</value>
>       <description>The maximum number of urls per host in a single
>       fetchlist.  -1 if unlimited.</description>
>    </property>
>    <property>
>        <name>plugin.includes</name>
>        
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|zip)|index-(basic|anchor|more)scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>               <description>Regular expression naming plugin directory names to
>                       include.  Any plugin not matching this expression is 
> excluded.
>                       In any case you need at least include the 
> nutch-extensionpoints plugin. By
>                       default Nutch includes crawling just HTML and plain 
> text via HTTP,
>                       and basic indexing and search plugins. In order to use 
> HTTPS please enable 
>                       protocol-httpclient, but be aware of possible 
> intermittent problems with the 
>                       underlying commons-httpclient library.
>               </description>
>    </property>
>       <property>
>               <name>fetcher.verbose</name>
>               <value>true</value>
>               <description>If true, fetcher will log more 
> verbosely.</description>
>       </property>
>       <property>
>               <name>http.verbose</name>
>               <value>true</value>
>               <description>If true, HTTP will log more 
> verbosely.</description>
>       </property>
>       <property>
>               <name>db.ignore.external.links</name>
>               <value>false</value>
>               <description>If true, outlinks leading from a page to external 
> hosts
>                       will be ignored. This is an effective way to limit the 
> crawl to include
>                       only initially injected hosts, without creating complex 
> URLFilters.
>               </description>
>       </property>
>       <property>
>               <name>db.ignore.internal.links</name>
>               <value>false</value>
>               <description>If true, when adding new links to a page, links 
> from
>                       the same host are ignored.  This is an effective way to 
> limit the
>                       size of the link database, keeping only the highest 
> quality
>                       links.
>               </description>
>       </property>
> </configuration>
> 
> 
> the crawl script:
> 
> 
> #!/bin/bash
> 
> clear
> 
> echo  "*********************************************"
> echo  "* This program crawl whole world web sites  *" 
> echo  "* and indexing these to Apache Solr.        *"
> echo  "*********************************************"  
> 
> # define functions
> 
> function functionPrintMessage {
>       msgType=$1
>       msgBody=$2
>       echo "[$msgType]:$msgBody"
> }
> 
> function functionPlayBeep {
>       echo -ne "\a"
> }
> 
> # set and initiate basic variables
>       
>       export 
> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
> 
>       crawlSizeBefore=""
>       crawlSizeAfter=""
>       generateDepth=3
>       directoryUrls="urls"
>       directoryCrawl="crawl"
>       directoryCrawldb="$directoryCrawl/crawldb"
>       directoryLinkdb="$directoryCrawl/linkdb"
>       directorySegments="$directoryCrawl/segments"
>       directoryLogs="logs"
>       binaryNutch="bin/nutch"
>       urlToSolr="http://127.0.0.1:8983/solr/";
>       logHodoop="$directoryLogs/hodoop.log"
>       tmpSegment=""
>       tstampBefore=""
>       tstampAfter=""
>       tstampDiff=""
> 
> # clear something
> 
>       if [ -e $logHodoop ]; then
>               rm $logHodoop
>       fi
> 
>       if [ ! -e $directoryCrawl/ ]; then
>               functionPrintMessage "STDERR" "THE $directoryCrawl FOLDER 
> DOESN'T EXISTS!"
>               mkdir $directoryCrawl/
>               functionPrintMessage "STDOUT" "I'VE BEEN DONE FOR YOU :P"
>       fi 
> 
> # begin crawling
> 
>       tstampBefore=$(date +%s)
>       
>       functionPrintMessage "NUTCH" "INJECTING"
>       echo
>       sleep 5
>       $binaryNutch inject $directoryCrawldb $directoryUrls
> 
>       echo
>       functionPrintMessage "STDOUT" "START GENERATING UNTIL DEPTH 
> $generateDepth"
> 
>       for (( i=0; i<$generateDepth; i++ )); do        
>       
>               echo
>               functionPlayBeep
>               functionPrintMessage "STDOUT" "CURRENT DEPTH IS `expr $i + 1`, 
> `expr $generateDepth - $i - 1` ITERATIONS REMAIN."
>               
>               echo
>               ls -al $directoryCrawl/
>               crawlSizeBefore=`du -hs $directoryCrawl`
> 
>               echo 
>               functionPrintMessage "NUTCH" "GENERATING"
>               echo
>               sleep 5 
>               $binaryNutch generate $directoryCrawldb $directorySegments 
> -stats -adddays 0 -topN 1000
> 
>               echo
>               ls -al $directoryCrawl/
>               crawlSizeAfter=`du -hs $directoryCrawl`
> 
>                       echo 
>                       functionPrintMessage "NUTCH" "FETCHING"
>                       echo
>                       sleep 5
>                       export tmpSegment=$directorySegments/`ls -tr 
> $directorySegments|tail -1`
>                       $binaryNutch fetch $tmpSegment -noParsing
> 
>                       echo 
>                       functionPrintMessage "NUTCH" "PARSING"
>                       echo
>                       sleep 5
>                       $binaryNutch parse $tmpSegment
> 
>                       functionPrintMessage "NUTCH" "UPDATING"
>                       echo
>                       sleep 5
>                       $binaryNutch updatedb $directoryCrawldb $tmpSegment 
> -filter -normalize
> 
>       done            
> 
>       echo 
>       functionPrintMessage "NUTCH" "INVERTING"
>       echo
>       sleep 5
>       $binaryNutch invertlinks $directoryLinkdb -dir $directorySegments
> 
>       echo 
>       functionPrintMessage "NUTCH" "INDEXING"
>       echo
>       sleep 5
>       $binaryNutch solrindex $urlToSolr $directoryCrawldb $directoryLinkdb 
> $directorySegments/*
>       
>       tstampAfter=$(date +%s)
>       tstampDiff=$(( $tstampAfter - $tstampBefore ))
> 
>       echo
>       functionPrintMessage "STDOUT" "CRAWLING FINISHED AFTER $tstampDiff 
> SECONDS (`expr $tstampDiff / 60` MINUTES)!"
>       echo
>       exit 0

Re: problem: crawl pdfs from a website and index these to solr

Reply via email to