Re: how to crawl multiple websites in each run?

2009-03-02 Thread yanky young
Hi: I am not an nutch expert though. But I think ur problem is easy. 1. make a list of seed urls in a file under urls folder 2. add all of the domain that you want to crawl to crawl-urlfilter.txt, just like this: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aaa.edu/

Re: why I cannot find this link?

2009-03-03 Thread yanky young
Hi: Why do u think nutch can't find http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US Actually http://app02.laopdr.gov.la/ is the same page as http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US if you find http://app02.laopdr.gov.la in your log, the

Re: how to crawl multiple websites in each run?

2009-03-03 Thread yanky young
will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property Tony Wang wrote: that helps a lot! thanks! 2009/3/2 yanky young yanky.yo...@gmail.com Hi: I am not an nutch expert though

Re: Keeping content fresh

2009-03-03 Thread yanky young
Hi: if you want adaptive fetching strategy only for specific domains, you can do this: write your own another *AdaptiveFetchSchedule*, see MyAdaptiveFetchSchedule MyAdaptiveFetchSchedule extends *AdaptiveFetchSchedule { *void

Re: why I cannot find this link?

2009-03-03 Thread yanky young
://app02.laopdr.gov.la/ and http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US in my fetch log, but I just cannot find the page. I'm doubting about dynamic pages... is that reasonable? 2009/3/3 yanky young yanky.yo...@gmail.com - 显示引用文字 - Hi: Why do u think nutch

Re: why I cannot find this link?

2009-03-03 Thread yanky young
words to search, ex: search opportunity or good opportunity, I found nothing. why? Yves 2009/3/4 yanky young yanky.yo...@gmail.com Hi: because they are actually the same page, you can only fine one. here is what i see when i use wget to fetch http://app02.laopdr.gov.la/: C

Re: what is needed to index for about 10000 domains

2009-03-03 Thread yanky young
Hi: Dog cutting has ever write a wiki about hardware requirement of nutch, you can check it out http://wiki.apache.org/nutch/HardwareRequirements good luck yanky 2009/3/4 John Martyniak j...@beforedawn.com Regarding the machine, you could run it on anything, it all depends what kind of

Re: why I cannot find this link?

2009-03-03 Thread yanky young
in the Lucene Document look like, is there maybe a truncation or did the page not get parsed right? On Mar 3, 2009, at 6:20 PM, yanky young wrote: sorry, i have no idea about this question. i guess there must be some kind of index leakage in nutch indexing process. some words must be ignored

Re: Parsing/Crawler Questions..

2009-03-04 Thread yanky young
Hi: you said that you are crawling college websites and use XPath to extract classes or courses information. That's good. But how do you determine a web page is about classes or not? If you just crawled the whole web site, that must be a complete crawl and thus a complete tree. If you use some

Re: Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.

2009-03-04 Thread yanky young
you can see hadoop log to find a clue good luck yanky 2009/3/5 dealmaker vin...@gmail.com I am using the nutch nightly build #741 (Mar 3, 2009 4:01:53 AM). I am at the final phrase of crawling following the tutorial on Nutch.org website. I ran the following command, and I got exception

Re: A General suggestion: To improve effectiveness of the forums

2009-03-04 Thread yanky young
Hi: good suggestion. and I'd like to share something about asking question in a smart way with all of you: http://www.rtfa.net/esrs-how-to-ask-questions-the-smart-way 2009/3/5 Venkateshprasanna prasanna...@yahoo.co.in A lot of queries in the forum go unanswered, either due to lack of

Re: Parsing/Crawler Questions..

2009-03-04 Thread yanky young
check your yahoo email box and find that email sent from nutch-user mailinglist and u will find any links you want there 2009/3/5 Edward Chen czy11...@yahoo.com Hi, how to remove me from this mailing list ? please give me a link. Good Luck. From: yanky

Re: URLFilter Plugin ClassNotFoundExpections

2009-03-09 Thread yanky young
Hi: maybe you'd better paste your plugin.folders property config in nutch-site.xm. when you run nutch crawl, nutch will load plugin as needed from plugin.folders.there are 2 places to check: 1. property plugin.folders should be configured to $NUTCH_HOME/src/plugin or $NUTCH_HOME/build/plugins,

Re: Limit Nutch Crawl to Seed URLs

2009-03-14 Thread yanky young
domain url filter seems in 1.0, maybe u can just checkout this plugin code from 1.0 trunk and build it into your 0.9 code base good luck yanky 2009/3/14 MyD myd.ro...@googlemail.com Where can I find the domain urlfilter? I'm using the branch 0.9... Cheers, Markus Dennis Kubes-2 wrote:

Re: The Future of Nutch

2009-03-14 Thread yanky young
Hi: I also agree that the most usage scenarios of nutch are in vertical search area. and in some unusual case users may don't even use nutch indexing at all. they just crawl some pages as mirror purpose. and in some cases of vertical search, user only need a fraction of pages, e.g. house rent

Re: synchronized File Writer

2009-03-16 Thread yanky young
Hi: it seems you are writing to a xml file with multiple threads. I guess it can be done by using BlockingQueue in java 1.5 concurrency api. you just add any url entry into the queue from multiple producer threads, and use a separate consumer thread to retrive url entries from the queue and write

Re: embed nutch crawl in an application

2009-03-18 Thread yanky young
Hi: you can see source code of Crawl class which can be used to start nutch by java command without cgywin. java -D... -classpath ... org.apache.nutch.crawl.Crawl urls -depth 10 -topN 1000 good luck yanky 2009/3/18 MyD myd.ro...@googlemail.com This is an interesting question. If you know

Re: Where to put plugin specific parameters / configurations

2009-03-18 Thread yanky young
Hi: you can put any parameters in nutch-site.xml as property settings, and get property from your plugin class by conf.get(your property name) good luck yanky 2009/3/18 MyD myd.ro...@googlemail.com Hi @ all, where is it possible to set plugin (my own plugin) specific parameters /

Re: Incremental index update

2009-03-18 Thread yanky young
Hi: according to my understanding, in nuch 1.0, you can configure nutch to recrawl with a specific schedule: see this issue: http://issues.apache.org/jira/browse/NUTCH-61 and this class: AdaptiveFetchSchedule by the way, there is no way to configure nutch to only recrawl changed website,

Re: index web

2009-03-19 Thread yanky young
Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.actionhttp://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome. the difference is the request_locale

Re: index web

2009-03-19 Thread yanky young
/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal

Re: index web

2009-03-20 Thread yanky young
reach those two urls so i am worried . 2009/3/20 yanky young yanky.yo...@gmail.com that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can

Re: index web

2009-03-20 Thread yanky young
in the url.txt? 2009/3/20 yanky young yanky.yo...@gmail.com I think my guess is right. I just see the code of that page. those two urls are generated by javascript function: function jump(lan) in this case, nutch might not be that smart to recognize this kind of generated url but if you

Re: Nutch Trunk Java requirement

2009-03-25 Thread yanky young
Hi: I have got the same error. And I installed jdk 1.6, then it works. It seems a bit weird because I see the javac requirement in the build.xml is 1.5 but it broke. I guess maybe hadoop jar was compiled in java 1.6 and class compatibility is 1.6 so you can't run it in java 1.5. yanky

Re: How to find out the encoding and format of the content stored in the index?

2009-04-05 Thread yanky young
hi: there is a index-more plugin that index some information about content type. u can have a look. 2009/4/5 dealmaker vin...@gmail.com Hi, I am trying to find out the encoding and format of the content stored in the index. I modified the code in BasicIndexFilter.java to store the

Re: How to find out the encoding and format of the content stored in the index?

2009-04-05 Thread yanky young
dealmaker vin...@gmail.com Thanks. Is there similar thing for encoding? I don't want to it to re-detect the encoding again for performance reason. yanky young wrote: hi: there is a index-more plugin that index some information about content type. u can have a look. 2009/4/5

nutch 0.9 protocol-file plugin break with windows file name that contains space

2009-04-06 Thread yanky young
Hi: I m using nutch 0.9 as base to do a project. When i use local files in windows xp2 system to test, I found that protocol-file plugin just breaks. For example: String url = file:///C:/cygwin/home/data/train/cv/Brendan%20O'Leary%20CV%20html.html; try { ProtocolOutput

Re: Problem crawling BBC Hindi Site

2009-04-06 Thread yanky young
Hi: if you just use nutch crawl command, you should put your domain names in crawl-urlfilter.txt like this: +^http://([a-z0-9]*\.)bbc.co.uk/hindi or +^http://www.bbc.co.uk/hindi good luck 2009/4/6, Ankur Garg garg.ankur.2...@gmail.com: Hi All, I am trying to crawl BBC Hindi site

Re: nutch-1.0 distribution config problem

2009-04-06 Thread yanky young
if nutch crawled some pages, there should be some fetching log in stdout like this: fetching http://www.law.harvard.edu/library/special/visit/reading-room-rules.html check your hadoop.log to see if there are lines like the above. or change your log4j.properties and set debug mode to Fetcher

Re: Crawler Output Flat file or Database?

2009-04-06 Thread yanky young
Hi: It is more wise to store files in DFS rather than in database. database is for structured data or data with schema. flat file is also not good for large data storage. DFS provide out-of-box replication for fault tolerance and what's more than is the mapreduce framework can be used on DFS to

why nutch repeat fetching some pages

2009-04-07 Thread yanky young
Hi guys: I am using nutch in a project. But I found that nutch repeat fetching some pages. For example: http://www.me.washington.edu//people/faculty/wang/ this is a page fetched. But also, there are some urls like this in commandline output: http://www.me.washington.edu//people/faculty/wang/

Re: why nutch repeat fetching some pages

2009-04-08 Thread yanky young
skovacevi...@gmail.com you can disable this in url-filter file, it is disabled by default. you ran into a loop on that site On Wed, Apr 8, 2009 at 7:32 AM, yanky young yanky.yo...@gmail.com wrote: Hi guys: I am using nutch in a project. But I found that nutch repeat fetching some

Re: Nutch can't find all files

2009-04-08 Thread yanky young
Hi: Of course u can look into code and add some debug lines in ur case. Just look at protocol-file plugin, which is supposed to process file:// scheme. You can find this plugin code in ${nutch_home}/src/plugin/protocol-file and as of nutch fetching list, you can dump crawldb by nutch readdb

Re: java heap space error

2009-04-09 Thread yanky young
why not just add -Xms -Xmx jvm parameters to see if it still happens 2009/4/9 srinivas jaini srinivasja...@gmail.com I've checked out code and am running crawl and get this error; any thoughts? environment: java 6, eclipse 009-04-08 01:22:41,658 INFO crawl.Injector

Re: number of fetcher threads per host?

2009-04-09 Thread yanky young
Hi: in nutch-site.xml you can define these properties: property namefetcher.server.delay/name value5.0/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property property namefetcher.max.crawl.delay/name

Re: app question....

2009-04-09 Thread yanky young
Hi: i am not sure i understand your question. can you give more details about your application? your data is a list of classes and some faculty info. so what's the structure or schema of these data? does the data come from database? If all of ur data is web pages, here is my hint: use some kind

Re: fetcher issues

2009-04-12 Thread yanky young
Hi: I have encountered a similar problem with local windows file system search with nutch 0.9. You can see my post here. http://www.nabble.com/nutch-0.9-protocol-file-plugin-break-with-windows-file-name-that--contains-space-td22903785.html. Hope it helps. good luck yanky 2009/4/13 Fadzi

Re: fetcher issues

2009-04-12 Thread yanky young
the db.max.outlinks.per.page to 1000 from 100 and i started getting exactly 500 documents instead of the 600 or so. So i changed it to -1 and i am still getting 500 docs! Not sure whats going on here. On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote: Hi: I have encountered a similar problem with local

Re: fetcher issues

2009-04-12 Thread yanky young
to 1000 from 100 and i started getting exactly 500 documents instead of the 600 or so. So i changed it to -1 and i am still getting 500 docs! Not sure whats going on here. On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote: Hi: I have encountered a similar problem

Re: Can't build Nutch

2009-04-20 Thread yanky young
Hi: just use JDK 1.6 instead. That will be fine. 2009/4/20 Filipe Antunes fantu...@tecnica.cc I Can't build Nutch with Ant. My ant version is 1.7.1 and i'm on a Mac OS X 10.4 PowerPC My java version is 1.5.0. I can't figure out why i'm having the error class file has wrong version 50.0,

Re: Topical/focus URL scoring

2009-05-13 Thread yanky young
Hi: I have done focused crawling with nutch a few months ago. What I did is to override some methods of scoring-opic plugin before and after passing, just as Krugler said. I have customized scoring meta data. And I even managed to integrate text classifier such as Baysian classifier to

Re: Topical/focus URL scoring

2009-05-14 Thread yanky young
and what they are for really. Is there any documentation that I should read first. -Ray- 2009/5/14 yanky young yanky.yo...@gmail.com Hi: I have done focused crawling with nutch a few months ago. What I did is to override some methods of scoring-opic plugin before and after passing

Re: Crawling blogs, feeds comments

2009-06-12 Thread yanky young
Hi: Myabe you just need to add url filter to your regex-*urlfilter*.txt configuration file. And if the feeds are rss or atom format, you should activate *parse-rss plugin*. Just add it into your nutch-site.xml plugins part. good luck yanky 2009/6/11 Xalan aaven...@gmail.com Regards, I