i don't know if they're really waiting but i'm almost sure you'll just get
5 threads fetching. if it's a problem for you i think u can turn this
restriction off. have an eye on the comments in nutch-default. it includes a
pair or restrictions for the same host like max_number_of request or
and what about indexing in solr during crawling? do the data need some
after-crawl processing?
On Mon, Mar 30, 2009 at 2:30 AM, ram_sj rpachaiyap...@gmail.com wrote:
Hi,
I'm trying to provide search functionality for our website using Apache
Solr. We have a in-house developed crawler which
it's your crawl-urlfilter ok? are u sure it's fetching them properly? maybe
it's not getting the content of the pages and so it cannot extract links for
fetch in the next level (i suppose you have set the crawl depth just for the
seeds level).
So or your filters are skipping the seeds (i suppose
/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
it's your crawl-urlfilter ok? are u sure it's fetching them properly? maybe
it's not getting the content of the pages and so it cannot extract links
for
fetch in the next level (i suppose you have set the crawl depth just for
the
seeds
and if you could post the log i think it'll be easier.
2009/4/1 陈琛 kylin.chc...@gmail.com
thanks,i have Collection of urls Only these four can not search a subset
of their pages
the urls and crawl-urlfilter like Attachment
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
send me the log of the crawling if possible. for sure there are some clues
on it
2009/4/1 陈琛 kylin.chc...@gmail.com
yes, the depth is 10 and topN is 2000...
So strangethe other urls it is normal..but the 4 urls..
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
seems
very much ;)
the log in the cygwin~(out.txt)
and the nutch log (hahoop.log)
i cannot find the any clues
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
send me the log of the crawling if possible. for sure there are some clues
on it
2009/4/1 陈琛 kylin.chc...@gmail.com
yes
website, Perhaps the link they are from other
sources
like some javasripts?
so i do not know what is right url can be fetch by nutch..
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
strange strange :). maybe you got a timeout error? have u change this
property in the nutch-site
try using this as filter in crawl-urlfilter.txt and comment the others
+lines
+^http://([a-z0-9]*\.)*
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
yeah i thought it first, but i've been having a look into those websites
and they have some normal links. i'm gonna deploy a nutch
...@gmail.com
fetch other urls , not Sub-page..
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
try using this as filter in crawl-urlfilter.txt and comment the others
+lines
+^http://([a-z0-9]*\.)*
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
yeah
is the sub-page of the http://www.corninc.com.la
2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com
i've try with these 3 sites and this is what my nutch got:
crawl started in: crawl-20090401170024
rootUrlDir = urls
threads = 5
depth = 3
topN = 30
Injector: starting
are you commenting or adapting this line in crawl-urlfilter ?
-^(file|ftp|mailto):
On Thu, Apr 2, 2009 at 5:23 PM, Wolf Fischer
wolf.fisc...@informatik.uni-augsburg.de wrote:
Hi there,
i currently try to use Nutch for a local file directory. I have the url to
the directory, which looks
you can also have an eye on the topN and number of fetching threads
On Thu, Apr 9, 2009 at 4:48 PM, yanky young yanky.yo...@gmail.com wrote:
why not just add -Xms -Xmx jvm parameters to see if it still happens
2009/4/9 srinivas jaini srinivasja...@gmail.com
I've checked out code and am
i think nutch is using the crawl dir as the urldir
Injector: urlDir: crawl
try this: rooturl -dir crawl -threads 5 -depth 3 -topN 3
On Thu, Apr 23, 2009 at 11:48 AM, askNutch hehehah...@126.com wrote:
thank you ,but i run in root!
Raymond Balmès wrote:
not sure if it helps, but I
i had some problems fetching gziped contents when setting content-limit=-1
in the 0.8 version...Maybe it's part of your problem?.Hope this will help
you
On Fri, May 1, 2009 at 3:43 PM, tsmori tim_m...@ncsu.edu wrote:
I'm having an interesting problem that I think revolves around the
interplay
15 matches
Mail list logo