[ 
https://issues.apache.org/jira/browse/NUTCH-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146163#comment-13146163
 ] 

behnam nikbakht commented on NUTCH-1199:
----------------------------------------

the problem is huge number of unfetched urls, for example we have only 2000 
fetched urls from a site with 40000 urls
and by command generate, we can not regenerate them and assign segments to 
them, so we use freegen command that create segments for unfetched urls and 
fetch them and update crawldb.
is it a good or bad solution?
                
> unfetched URLs problem
> ----------------------
>
>                 Key: NUTCH-1199
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1199
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher, generator
>            Reporter: behnam nikbakht
>            Priority: Critical
>              Labels: db_unfetched, fetch, freegen, generate, unfetched, 
> updatedb
>
> we write a script to fetch unfetched urls:
> #first dump from readdb to a text file, and extract unfetched urls to a text 
> file:
>         bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format 
> csv
>         cat $SITE_DIR/tmp/dump_urls.txt/part-00000 | grep db_unfetched > 
> $SITE_DIR/tmp/dump_unf
>         unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt"
>         cat $SITE_DIR/tmp/dump_unf | awk -F '"' '{print $2}' >  
> $unfetched_urls_file
>         unfetched_count=`cat $unfetched_urls_file|wc -l`
> #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use 
> command freegen to create segments for #these urls, we can not use command 
> generate because these url's were generated previously
>        if [[ $unfetched_count -lt $it_size ]]
>        then
>                         echo "UNFETCHED $J , $it_size URLs from 
> $unfetched_count generated"
>                         ((J++))
>                         bin/nutch freegen 
> $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg
>                         s2=`ls -d $crawlseg/2* | tail -1`
>                         bin/nutch fetch $s2
>                         bin/nutch parse $s2
>                         bin/nutch updatedb $crawldb $s2
>                         echo "bin/nutch updatedb $crawldb $s2" >> 
> $SITE_DIR/updatedblog.txt
>                         get_new_links
>                         exit
>        fi
> # if number of urls are greater than it_size, then package them
>         ij=1
>         while read line
>         do
>                 let "ind = $ij / $it_size"
>                 mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
>                 echo $line >> 
> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
>                 echo $ind
>                 ((ij++))
>                 let "completed=$ij % $it_size"
>                if [[ $completed -eq 0 ]]
>                then
>                                                                   echo 
> "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
>                         ((J++))
>                         bin/nutch freegen 
> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt 
> $crawlseg
> #finally fetch,parse and update new segment
>                         s2=`ls -d $crawlseg/2* | tail -1`
>                         bin/nutch fetch $s2
>                         bin/nutch parse $s2
>                         rm $crawldb/.locked
>                         bin/nutch updatedb $crawldb $s2
>                         echo "bin/nutch updatedb $crawldb $s2" >> 
> $SITE_DIR/updatedblog.txt
>                fi
>         done <$unfetched_urls_file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to