Re: fetcher questions

2009-03-26 Thread Alejandro Gonzalez
i don't know if they're really waiting but i'm almost sure you'll just get 5 threads fetching. if it's a problem for you i think u can turn this restriction off. have an eye on the comments in nutch-default. it includes a pair or restrictions for the same host like max_number_of request or

Re: Crawler Output Flat file or Database?

2009-03-30 Thread Alejandro Gonzalez
and what about indexing in solr during crawling? do the data need some after-crawl processing? On Mon, Mar 30, 2009 at 2:30 AM, ram_sj rpachaiyap...@gmail.com wrote: Hi, I'm trying to provide search functionality for our website using Apache Solr. We have a in-house developed crawler which

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
it's your crawl-urlfilter ok? are u sure it's fetching them properly? maybe it's not getting the content of the pages and so it cannot extract links for fetch in the next level (i suppose you have set the crawl depth just for the seeds level). So or your filters are skipping the seeds (i suppose

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com it's your crawl-urlfilter ok? are u sure it's fetching them properly? maybe it's not getting the content of the pages and so it cannot extract links for fetch in the next level (i suppose you have set the crawl depth just for the seeds

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
and if you could post the log i think it'll be easier. 2009/4/1 陈琛 kylin.chc...@gmail.com thanks,i have Collection of urls Only these four can not search a subset of their pages the urls and crawl-urlfilter like Attachment 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
send me the log of the crawling if possible. for sure there are some clues on it 2009/4/1 陈琛 kylin.chc...@gmail.com yes, the depth is 10 and topN is 2000... So strangethe other urls it is normal..but the 4 urls.. 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com seems

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
very much ;) the log in the cygwin~(out.txt) and the nutch log (hahoop.log) i cannot find the any clues 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com send me the log of the crawling if possible. for sure there are some clues on it 2009/4/1 陈琛 kylin.chc...@gmail.com yes

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
website, Perhaps the link they are from other sources like some javasripts? so i do not know what is right url can be fetch by nutch.. 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com strange strange :). maybe you got a timeout error? have u change this property in the nutch-site

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
try using this as filter in crawl-urlfilter.txt and comment the others +lines +^http://([a-z0-9]*\.)* 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com yeah i thought it first, but i've been having a look into those websites and they have some normal links. i'm gonna deploy a nutch

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
...@gmail.com fetch other urls , not Sub-page.. 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com try using this as filter in crawl-urlfilter.txt and comment the others +lines +^http://([a-z0-9]*\.)* 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com yeah

Re: only fetch home page

2009-04-01 Thread Alejandro Gonzalez
is the sub-page of the http://www.corninc.com.la 2009/4/1 Alejandro Gonzalez alejandrogonzalezd...@gmail.com i've try with these 3 sites and this is what my nutch got: crawl started in: crawl-20090401170024 rootUrlDir = urls threads = 5 depth = 3 topN = 30 Injector: starting

Re: Problem with Crawler and Parent Directories

2009-04-02 Thread Alejandro Gonzalez
are you commenting or adapting this line in crawl-urlfilter ? -^(file|ftp|mailto): On Thu, Apr 2, 2009 at 5:23 PM, Wolf Fischer wolf.fisc...@informatik.uni-augsburg.de wrote: Hi there, i currently try to use Nutch for a local file directory. I have the url to the directory, which looks

Re: java heap space error

2009-04-09 Thread Alejandro Gonzalez
you can also have an eye on the topN and number of fetching threads On Thu, Apr 9, 2009 at 4:48 PM, yanky young yanky.yo...@gmail.com wrote: why not just add -Xms -Xmx jvm parameters to see if it still happens 2009/4/9 srinivas jaini srinivasja...@gmail.com I've checked out code and am

Re: run nutch on eclipse problem?

2009-04-23 Thread Alejandro Gonzalez
i think nutch is using the crawl dir as the urldir Injector: urlDir: crawl try this: rooturl -dir crawl -threads 5 -depth 3 -topN 3 On Thu, Apr 23, 2009 at 11:48 AM, askNutch hehehah...@126.com wrote: thank you ,but i run in root! Raymond Balmès wrote: not sure if it helps, but I

Re: NullPointerExceptions in Fetch

2009-05-04 Thread Alejandro Gonzalez
i had some problems fetching gziped contents when setting content-limit=-1 in the 0.8 version...Maybe it's part of your problem?.Hope this will help you On Fri, May 1, 2009 at 3:43 PM, tsmori tim_m...@ncsu.edu wrote: I'm having an interesting problem that I think revolves around the interplay