problem: crawl pdfs from a website and index these to solr

toocrazymail Thu, 01 Apr 2010 05:18:52 -0700

hi list,

i am new in using nutch (solr and nutch integrated in solr too) and my english 
is not well :/ in my company i have the task to replace the media-wiki search 
function with a solr (include nutch) search function. furthermore i have to 
write a small program (java) which read ONE configuration file (xml), that 
configure all necessary things (solr-server-address, nutch-path, 
regex-urlfilter, seeds, paths, nutch-site.xml and so on by overwrite the solr- 
and nutch-configurations). the solr-server shall indexing the configured paths 
of intranet (file, smb-shares, svn, ...) and hold the index and nutch shall 
crawl the configured websites (html, pdf, doc, ...) and indexing these to 
solr-server. currently i use the whole-web-crawl-script shown below. the  
indexing of the plain/text websites into solr is not a problem, but when i 
would like to crawl a website (included pdf files) there is (may be) a 
unresolved problem for me. my example configurations for crawling websites 
(with pdfs and so on) see also below. after the website is crawled, stored and 
indexed i start a query to solr to find *ecv.pdf 
(http://127.0.0.1:8983/solr/select/?q=*ecv.pdf&qt=standard), but nothing found 
:( but when i start the query to find *ctan* 
(http://127.0.0.1:8983/solr/select/?q=*ctan*&qt=standard) it matches and i get 
i.a. (the highlighted output contains the url and a fragsize of 600 characters):



http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/ 
<em>http://www.ctan.org</em>/. Installation ------------ After unzipping the 
package call: $ latex ecv.ins This will be extract some files (ecv.cls, 
ecvNLS.sty...). Create a directory beneath your TeX installation preferably 
create tex/latex/ecv and copy all files of the package and the extracted files 
to that directory. Call: mktexlsr Templates --------- In template.zip document 
templates for a german and an english CV can be found. Just unzip the 
template.zip, cd to template and run make to get the pdf. /Bernd Haberstumpf, 
<po...@kabatrinker.de>, 2007-01-06 /Christoph Neumann, <c.p

but i like results as regular solr results to query on pdfs like these:

/Users/USER_NAME/home/studies/semester-vi/bachelor/bachelorthesis/main.pdf
Symbolverzeichnis VIII Tabellenverzeichnis IX Listingverzeichnis X 1 Einleitung 
1 2 Kapitel 1 2 3 Griechische Symbole 2 4 Kapitel 2 3 5 Zusammenfassung und 
Schlusswort 4 Anhang XI Literaturverzeichnis XII Glossar XIII V 
Abbildungsverzeichnis VI Abkürzungsverzeichnis AD . . . . . . . . . . . . . . . 
. . . . . Active Directory CD . . . . . . . . . . . . . . . . . . . . Compact 
Disc MS . . . . . . . . . . . . . . . . . . . . Microsoft VII Symbolverzeichnis 
. . . . . . . . . . . . . . . . . . . . . . Eine beliebige Zahl, mit der der 
nachfol- gende Ausdruck multipliziert wird. ? . . . . . . . . . . . . . . . . . 
. . . . . Ein beliebiger Winkel. . . . . . . . . . . . . . . . . . . . . . . 
Die Kreiszahl. VIII Tabellenverzeichnis IX

this means, that only the website is indexed and not the pdfs :( but i 
explicitly said in the nutch-site.xml, that nutch have to parse pdfs. can 
anyone help me?

best regards marcel :)

urls/seed.txt:

http://www.ctan.org/tex-archive/macros/latex/contrib/ecv/

regex-urlfilter.txt:

-^(https|telnet|file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|wmf|mpg|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[...@=]
+^http://([a-z0-9]*\.)*ctan.org/tex-archive/macros/latex/contrib/ecv/
+.*

nutch-site.xml:

<configuration>
    <property>
        <name>http.robots.agents</name>
        <value>robots</value>
        <description>see nutch-default.xml</description>
    </property>

    <property>
        <name>http.agent.name</name>
        <value>name</value>
        <description>see nutch-default.xml</description>
    </property>
    
    <property>
        <name>http.agent.description</name>
        <value>desc</value>
        <description>see nutch-default.xml</description>
    </property>
    
    <property>
        <name>http.agent.url</name>
        <value>url</value>
        <description>see nutch-default.xml</description>
    </property>
    
    <property>
        <name>http.agent.email</name>
        <value>email</value>
        <description>see nutch-default.xml</description>
    </property>
    <property>
        <name>generate.max.per.host</name>
        <value>30</value>
        <description>The maximum number of urls per host in a single
        fetchlist.  -1 if unlimited.</description>
    </property>
    <property>
        <name>plugin.includes</name>
        
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|zip)|index-(basic|anchor|more)scoring-opic|urlnormalizer-(pass|regex|basic)</value>
                <description>Regular expression naming plugin directory names to
                        include.  Any plugin not matching this expression is 
excluded.
                        In any case you need at least include the 
nutch-extensionpoints plugin. By
                        default Nutch includes crawling just HTML and plain 
text via HTTP,
                        and basic indexing and search plugins. In order to use 
HTTPS please enable 
                        protocol-httpclient, but be aware of possible 
intermittent problems with the 
                        underlying commons-httpclient library.
                </description>
    </property>
        <property>
                <name>fetcher.verbose</name>
                <value>true</value>
                <description>If true, fetcher will log more 
verbosely.</description>
        </property>
        <property>
                <name>http.verbose</name>
                <value>true</value>
                <description>If true, HTTP will log more 
verbosely.</description>
        </property>
        <property>
                <name>db.ignore.external.links</name>
                <value>false</value>
                <description>If true, outlinks leading from a page to external 
hosts
                        will be ignored. This is an effective way to limit the 
crawl to include
                        only initially injected hosts, without creating complex 
URLFilters.
                </description>
        </property>
        <property>
                <name>db.ignore.internal.links</name>
                <value>false</value>
                <description>If true, when adding new links to a page, links 
from
                        the same host are ignored.  This is an effective way to 
limit the
                        size of the link database, keeping only the highest 
quality
                        links.
                </description>
        </property>
</configuration>


the crawl script:


#!/bin/bash

clear

echo    "*********************************************"
echo    "* This program crawl whole world web sites  *" 
echo    "* and indexing these to Apache Solr.        *"
echo    "*********************************************"  

# define functions

function functionPrintMessage {
        msgType=$1
        msgBody=$2
        echo "[$msgType]:$msgBody"
}

function functionPlayBeep {
        echo -ne "\a"
}

# set and initiate basic variables
        
        export 
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

        crawlSizeBefore=""
        crawlSizeAfter=""
        generateDepth=3
        directoryUrls="urls"
        directoryCrawl="crawl"
        directoryCrawldb="$directoryCrawl/crawldb"
        directoryLinkdb="$directoryCrawl/linkdb"
        directorySegments="$directoryCrawl/segments"
        directoryLogs="logs"
        binaryNutch="bin/nutch"
        urlToSolr="http://127.0.0.1:8983/solr/";
        logHodoop="$directoryLogs/hodoop.log"
        tmpSegment=""
        tstampBefore=""
        tstampAfter=""
        tstampDiff=""

# clear something

        if [ -e $logHodoop ]; then
                rm $logHodoop
        fi

        if [ ! -e $directoryCrawl/ ]; then
                functionPrintMessage "STDERR" "THE $directoryCrawl FOLDER 
DOESN'T EXISTS!"
                mkdir $directoryCrawl/
                functionPrintMessage "STDOUT" "I'VE BEEN DONE FOR YOU :P"
        fi 

# begin crawling

        tstampBefore=$(date +%s)
        
        functionPrintMessage "NUTCH" "INJECTING"
        echo
        sleep 5
        $binaryNutch inject $directoryCrawldb $directoryUrls

        echo
        functionPrintMessage "STDOUT" "START GENERATING UNTIL DEPTH 
$generateDepth"

        for (( i=0; i<$generateDepth; i++ )); do        
        
                echo
                functionPlayBeep
                functionPrintMessage "STDOUT" "CURRENT DEPTH IS `expr $i + 1`, 
`expr $generateDepth - $i - 1` ITERATIONS REMAIN."
                
                echo
                ls -al $directoryCrawl/
                crawlSizeBefore=`du -hs $directoryCrawl`

                echo 
                functionPrintMessage "NUTCH" "GENERATING"
                echo
                sleep 5 
                $binaryNutch generate $directoryCrawldb $directorySegments 
-stats -adddays 0 -topN 1000

                echo
                ls -al $directoryCrawl/
                crawlSizeAfter=`du -hs $directoryCrawl`

                        echo 
                        functionPrintMessage "NUTCH" "FETCHING"
                        echo
                        sleep 5
                        export tmpSegment=$directorySegments/`ls -tr 
$directorySegments|tail -1`
                        $binaryNutch fetch $tmpSegment -noParsing

                        echo 
                        functionPrintMessage "NUTCH" "PARSING"
                        echo
                        sleep 5
                        $binaryNutch parse $tmpSegment

                        functionPrintMessage "NUTCH" "UPDATING"
                        echo
                        sleep 5
                        $binaryNutch updatedb $directoryCrawldb $tmpSegment 
-filter -normalize

        done            

        echo 
        functionPrintMessage "NUTCH" "INVERTING"
        echo
        sleep 5
        $binaryNutch invertlinks $directoryLinkdb -dir $directorySegments

        echo 
        functionPrintMessage "NUTCH" "INDEXING"
        echo
        sleep 5
        $binaryNutch solrindex $urlToSolr $directoryCrawldb $directoryLinkdb 
$directorySegments/*
        
        tstampAfter=$(date +%s)
        tstampDiff=$(( $tstampAfter - $tstampBefore ))

        echo
        functionPrintMessage "STDOUT" "CRAWLING FINISHED AFTER $tstampDiff 
SECONDS (`expr $tstampDiff / 60` MINUTES)!"
        echo
        exit 0

problem: crawl pdfs from a website and index these to solr

Reply via email to