first i set conf/crawl-urlfilter that # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse -\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else +. i can crawl "http://guide.kapook.com" but i can't crawl "http://www.kapook.com" some webpage can't crawl all i want to know why? after crawl index file not complete it's not have segments file it have only /user/nutch/crawld/indexes/part-00000/_0.fdt <r 1> 365 /user/nutch/crawld/indexes/part-00000/_0.fdx <r 1> 8 /user/nutch/crawld/indexes/part-00000/_0.fnm <r 1> 66 /user/nutch/crawld/indexes/part-00000/_0.frq <r 1> 370 /user/nutch/crawld/indexes/part-00000/_0.nrm <r 1> 9 /user/nutch/crawld/indexes/part-00000/_0.prx <r 1> 611 /user/nutch/crawld/indexes/part-00000/_0.tii <r 1> 135 /user/nutch/crawld/indexes/part-00000/_0.tis <r 1> 10553 /user/nutch/crawld/indexes/part-00000/index.done <r 1> 0 /user/nutch/crawld/indexes/part-00000/segments.gen <r 1> 20 /user/nutch/crawld/indexes/part-00000/segments_2 <r 1> 41 /user/nutch/crawld/indexes/part-00001/index.done <r 1> 0 /user/nutch/crawld/indexes/part-00001/segments.gen <r 1> 20 /user/nutch/crawld/indexes/part-00001/segments_1 <r 1> 20 how i solve it? -- View this message in context: http://www.nabble.com/nutch-crawl-and-index-problem-tp14703815p14703815.html Sent from the Hadoop Users mailing list archive at Nabble.com.
