No that's not bad, but Jira is for issue tracking not general problems and questions. Please use the [email protected] mailing list. I will also help if you write down your problem instead of just copying a shell script.
Cheers > [ > https://issues.apache.org/jira/browse/NUTCH-1199?page=com.atlassian.jira.p > lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146163#comm > ent-13146163 ] > > behnam nikbakht commented on NUTCH-1199: > ---------------------------------------- > > the problem is huge number of unfetched urls, for example we have only 2000 > fetched urls from a site with 40000 urls and by command generate, we can > not regenerate them and assign segments to them, so we use freegen command > that create segments for unfetched urls and fetch them and update crawldb. > is it a good or bad solution? > > > unfetched URLs problem > > ---------------------- > > > > Key: NUTCH-1199 > > URL: https://issues.apache.org/jira/browse/NUTCH-1199 > > > > Project: Nutch > > > > Issue Type: Improvement > > Components: fetcher, generator > > > > Reporter: behnam nikbakht > > Priority: Critical > > > > Labels: db_unfetched, fetch, freegen, generate, unfetched, > > updatedb > > > > we write a script to fetch unfetched urls: > > > > #first dump from readdb to a text file, and extract unfetched urls to a text file: > > bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt > > -format csv cat $SITE_DIR/tmp/dump_urls.txt/part-00000 | grep > > db_unfetched > $SITE_DIR/tmp/dump_unf > > unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls > > .txt" cat $SITE_DIR/tmp/dump_unf | awk -F '"' '{print $2}' > > > $unfetched_urls_file unfetched_count=`cat > > $unfetched_urls_file|wc -l` > > > > #next, we have a list of unfetched urls in unfetched_urls.txt , then, we > > use command freegen to create segments for #these urls, we can not use > > command generate because these url's were generated previously > > > > if [[ $unfetched_count -lt $it_size ]] > > then > > > > echo "UNFETCHED $J , $it_size URLs from > > $unfetched_count generated" ((J++)) > > bin/nutch freegen > > $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt > > $crawlseg s2=`ls -d $crawlseg/2* | tail -1` > > bin/nutch fetch $s2 > > bin/nutch parse $s2 > > bin/nutch updatedb $crawldb $s2 > > echo "bin/nutch updatedb $crawldb $s2" >> > > $SITE_DIR/updatedblog.txt get_new_links > > exit > > > > fi > > > > # if number of urls are greater than it_size, then package them > > > > ij=1 > > while read line > > do > > > > let "ind = $ij / $it_size" > > mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/ > > echo $line >> > > $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetche > > d_urls$ind.txt echo $ind > > ((ij++)) > > let "completed=$ij % $it_size" > > > > if [[ $completed -eq 0 ]] > > then > > > > echo "UNFETCHED $J > > , $it_size URLs > > from > > $unfetched_count > > generated" > > > > ((J++)) > > bin/nutch freegen > > $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/ > > unfetched_urls$ind.txt $crawlseg > > > > #finally fetch,parse and update new segment > > > > s2=`ls -d $crawlseg/2* | tail -1` > > bin/nutch fetch $s2 > > bin/nutch parse $s2 > > rm $crawldb/.locked > > bin/nutch updatedb $crawldb $s2 > > echo "bin/nutch updatedb $crawldb $s2" >> > > $SITE_DIR/updatedblog.txt > > > > fi > > > > done <$unfetched_urls_file > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira

