Re: [Nutch-dev] Reviving Nutch 0.7

2007-01-22 Thread AJ Chen
as possible so that I can easily upgrade the application to keep up with new nutch release. Keeping away from the newest nutch version is somewhat backward to me. AJ -- AJ Chen, PhD Palo Alto, CA http://web2express.org - Take Surveys

Re: [Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-22 Thread AJ Chen
-- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira -- AJ Chen, PhD Palo Alto, CA http

Re: [Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-22 Thread AJ Chen
sec ratio seems very low to me. How big was your crawldb when you started and how big was it at end? What kind of filters and normalizers are you using? -- Sami Siren AJ Chen wrote: I checked out the code from trunk after Sami committed the change. I started out a new crawl db and run

Re: [Nutch-dev] [jira] Resolved: (NUTCH-395) Increase fetching speed

2006-11-13 Thread AJ Chen
/jira -- AJ Chen, PhD http://web2express.org - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere

[Nutch-dev] [jira] Created: (NUTCH-398) map-reduce very slow when crawling on single server

2006-11-07 Thread AJ Chen (JIRA)
Affects Versions: 0.8.1 Environment: linux and windows Reporter: AJ Chen This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop configuration (local file system, no distributed crawling), the Generator

[Nutch-dev] how to minimize reduce operations when using single machine

2006-10-27 Thread AJ Chen
I use 0.9-dev code and local file system to crawl on a single machine. After fetching pages, nutch spends huge amount of time doing reduce sort and reduce reduce reduce. This is not necessary since it uses only the local file system. I'm not familiar with map-reduce code, but guess it may be

Re: [Nutch-dev] log error in deploying nutch-0.9-dev.jar

2006-09-07 Thread AJ Chen
This is solved. I accidentally put log4j.prperties into ROOT\WEB-INF\classes. -aj On 9/7/06, AJ Chen [EMAIL PROTECTED] wrote: I'm customizing 0.9-dev code for my vertical search engine. After rebuild the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error when starting

[Nutch-dev] log error in deploying nutch-0.9-dev.jar

2006-09-07 Thread AJ Chen
I'm customizing 0.9-dev code for my vertical search engine. After rebuild the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error when starting Tomcat: log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (The system cannot find the path specified)

[Nutch-dev] fetcher status missing in log file

2006-08-30 Thread AJ Chen
I'm using nutch-0.9-dev from svn. hadoop.log has records from fetching except the status line. is there a setting required to print the fetch status line? the status is set in Fetcher.java: report.setStatus(string), but where does the report object print the status? thanks, -- AJ Chen http

Re: [Nutch-dev] Crawl error

2006-07-10 Thread AJ Chen
Groschupf [EMAIL PROTECTED] wrote: Try to put the conf folder to your classpath in eclipse and set the environemnt variables that are setted in bin/nutch. Btw, please do not crosspost. Thanks. Stefan Am 09.07.2006 um 21:47 schrieb AJ Chen: I checked out the 0.8 code from trunk and tried to set

[Nutch-dev] Crawl error

2006-07-09 Thread AJ Chen
I checked out the 0.8 code from trunk and tried to set it up in eclipse. When trying to run Crawl from Eclipse using args urls -dir crawl -depth 3 -topN 50, I got the following error, which started from LogFactory.getLog( Crawl.class). Any idea what file was not found? There is a url file under

Re: [Nutch-dev] does nutch follow HEAD link element?

2006-06-16 Thread AJ Chen
will be searched from the same nutch search interface. Thanks, AJ On 6/16/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: AJ Chen wrote: I'm about to use nutch to crawl semantic data. Links to semantic data files (RDF, OWL, etc.) can be placed in two places: (1) HEAD link; (2) BODY a href Does

[Nutch-dev] Re: problems http-client

2006-01-06 Thread AJ Chen
I have started to see this problem recently. topN=20 per crawl, but fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000 pages are missing. this is reproducible with nutch0.7.1, both protocol-http and protocol-httpclient are included. I also see lots of Response content

[Nutch-dev] how to add additional factor at search time to ranking score

2005-12-31 Thread AJ Chen
My vertical search application will use additional factor for page ranking, which is given to each page at search time. I'm trying to figure out a good way to integrate this additional dynamic factor into the nutch score. I'll appreciate any suggestions or pointers. It would be great if I

[Nutch-dev] Re: severe error in fetch

2005-12-30 Thread AJ Chen
connection pool problem in httpclient? If yes, I can filter out url containing these trouble ports before httpclient is fixed. Thanks, AJ On 12/26/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: AJ Chen wrote: Stefan, Here is the trace in my log. My SSFetcher (for site-specific fetch) is the same

[Nutch-dev] severe error in fetch

2005-12-25 Thread AJ Chen
I have seen repeatedly the following severe errors during fetching 400,000 pages with 200 threads. What may cause Host connection pool not found? This type of error must be avoided, otherwise the fetcher will stop prematurely. 051224 075950 SEVERE Host connection pool not found,

[Nutch-dev] Re: severe error in fetch

2005-12-25 Thread AJ Chen
) at vscope.crawl.SSCrawler.main(SSCrawler.java:251) Thanks, AJ On 12/25/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, Can you provide a detailed stacktrace from the log file? Stefan Am 25.12.2005 um 23:38 schrieb AJ Chen: I have seen repeatedly the following severe errors during fetching 400,000 pages

[Nutch-dev] java open source software for Tagging ?

2005-11-07 Thread AJ Chen
Although tagging is not directly related to nutch, I think combining nutch search and the ability to tag search result pages will be quite powerful. Anyone has implemented tagging on nutch search site? Is there a java open source package for tagging function? AJ

[Nutch-dev] debug JSP with eclipse

2005-10-30 Thread AJ Chen
I'm using elicpse for nutch java code and trying to set up eclipse for debugging JSP pages. I have got WST plugin installed, created a new dynamic web project called nutch071web, and imported all the webcontent and jars. But, it failed to run index.jsp page, see error message below. Is anyone

[Nutch-dev] merge indices from multiple webdb

2005-10-25 Thread AJ Chen
Has anyone merged indices from two separate webdb? I have two separate webdb and need to find a good way to combine them for unified search. AJ

[Nutch-dev] Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 25, 2005 4:03 PM To: nutch-dev@lucene.apache.org Subject: merge indices from multiple webdb Has anyone merged indices from two separate webdb? I have two separate webdb and need to find a good way to combine them

[Nutch-dev] Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
and then build one more segment again. Thank you, Andrey -Original Message- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 25, 2005 2:02 PM To: nutch-dev@lucene.apache.org Subject: Re: merge indices from multiple webdb Thanks so much, Graham. This should do it. A related

[Nutch-dev] how to make fetcher to use the full bandwidth

2005-10-13 Thread AJ Chen
I try to fetch as fast as it can by using more threads on a large fetch list. But, the fetcher starts download at speed much lower than the full bandwidth allows. And the start download speed varies a lot from run to run, 200kb/s to 1200kb/s on my DSL line. This variation also happens on T1 line

[Nutch-dev] Re: how to make fetcher to use the full bandwidth

2005-10-13 Thread AJ Chen
, Rod Taylor [EMAIL PROTECTED] wrote: On Thu, 2005-10-13 at 13:35 -0700, AJ Chen wrote: I try to fetch as fast as it can by using more threads on a large fetch list. But, the fetcher starts download at speed much lower than the full bandwidth allows. And the start download speed varies a lot

[Nutch-dev] Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread AJ Chen
Fuad, Several days for 120,000 pages? That's very slow. Could you show some status lines in the log file? (grep status:) What's the bandwidth you have? -AJ On 10/11/05, Fuad Efendi (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Fuad Efendi updated

[Nutch-dev] fetch speed issue

2005-10-10 Thread AJ Chen
Another observation: when the same size fetch list and same number of threads were used, the fetcher started at different speed in different runs, ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation in downlaod speed could be due to the variation in DSL connection. If using

[Nutch-dev] Re: what contibute to fetch slowing down

2005-10-02 Thread AJ Chen
several days at current speed - just too slow. I'm planning to get more bandwidth. Could someone share their experience on what stable rate (pages/sec) can be achieved using 3 mbps or 10 mbps inbound connection? Thanks, AJ On 9/28/05, AJ Chen [EMAIL PROTECTED] wrote: I started the crawler

[Nutch-dev] what contibute to fetch slowing down

2005-09-28 Thread AJ Chen
I started the crawler with about 2000 sites. The fetcher could achieve 7 pages/sec initially, but the performance gradually dropped to about 2 pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages and I used 500 threads. What are the main causes of this slowing down? Below

[Nutch-dev] Re: saving log file

2005-09-21 Thread AJ Chen
Jerome, thanks a lot. This is helpful. -AJ Jérôme Charron wrote: Following the tutorial, I redirect the log messages to a log file. But, when crawling 1 million pages, this log file can become hugh and writing log messages to a huge file can slow down the fetching process. Is there a better

[Nutch-dev] saving log file

2005-09-20 Thread AJ Chen
Following the tutorial, I redirect the log messages to a log file. But, when crawling 1 million pages, this log file can become hugh and writing log messages to a huge file can slow down the fetching process. Is there a better way to manage the log? maybe saving it to a series of smaller

[Nutch-dev] Re: how to reuse webDB with new urls

2005-09-14 Thread AJ Chen
/13/05, Michael Ji [EMAIL PROTECTED] wrote: I think this scenario will work. Just a bit worry about the filter performance if the domain site number is in scale of thundreds of thousands. Michael Ji --- AJ Chen [EMAIL PROTECTED] wrote: Once I create a webDB, can I inject new root urls

[Nutch-dev] how to reuse webDB with new urls

2005-09-13 Thread AJ Chen
Once I create a webDB, can I inject new root urls to the same webDB repeatly? After each injection, run as many cycles of generate/fetch/updatedb to fetch all web pages from the new sites. I think this will allow me to gradually build a comprehensive vertical site. Any comment or suggestion?

[Nutch-dev] Re: fetch performance

2005-09-10 Thread AJ Chen
Andrzej, Thanks. A related question: Some of the sites I crawl use https: or redirect to https:. Nutch default setting does not recognize https: as valid url. Is there a way to crawl url starting with https:? -AJ Andrzej Bialecki wrote: AJ Chen wrote: Hi Andrzej, Thanks

[Nutch-dev] Re: fetch performance

2005-09-10 Thread AJ Chen
, 050910 150341 fetch of http://www.cellsciences.com/content/c2-contact.asp failed with: java.lang.Exception: org.apache.n utch.protocol.http.HttpException: Not an HTTP url:https://www.cellsciences.com/content/c2-contact.asp Any idea what happens? -AJ Andrzej Bialecki wrote: AJ Chen wrote

[Nutch-dev] Re: db.max.outlinks.per.page is misunderstood?

2005-09-07 Thread AJ Chen
My understanding is that only up to the maximum number of outlinks are processed for a page when updating the web db. I assume the same page won't get fetched and processed again in the next fetch/update cycles, then you won't get those outlinks exceeding the maximum number no matter how many

[Nutch-dev] Re: db.max.outlinks.per.page is misunderstood?

2005-09-07 Thread AJ Chen
Jack, Set the max to 100, but run 10 cycles (i.e., depth=10) with the CrawlTool. You may see all the outlinks are collected toward the end. 3 cycles is usually not enough. -AJ Jack Tang wrote: Yes, Stefan. But it missed some URLs, and I set the value to 3000, then everything is OK /Jack

[Nutch-dev] Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
I'm also thinking about implementing an automated workflow of fetchlist-crawl-updateDb-index. Although my project may not require NDSF because it only concerns about deep crawling of 100,000 sites, an appropriate workflow is still needed to automatically take care of failed urls, newly-added

[Nutch-dev] Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
From reading http://wiki.apache.org/nutch/DissectingTheNutchCrawler, it seems that a new urlfilter is a good place to extend the inclusion regex capability. The new urlfilter will be defined by urlfilter.class property, which gets loaded by the URLFilterFactory. Regex is necessary because you

[Nutch-dev] Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
for a public beta. I'll be sure to post here when we're finally open for business. :) --Matt On Sep 2, 2005, at 11:43 AM, AJ Chen wrote: From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, it seems that a new urlfilter is a good place to extend the inclusion regex capability

[Nutch-dev] [jira] Created: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2005-09-02 Thread AJ Chen (JIRA)
-platform Reporter: AJ Chen There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So

[Nutch-dev] manage crawling cycles and progress

2005-09-01 Thread AJ Chen
Seeded with a list of urls, nutch whole-web crawler is going to take unknown number of cycles of generate/fetch/updatedb in order to drive to some level of completeness, both for internal links and outlinks. It's crucial to monitor the progress. I'll appreciate some suggestions or best

[Nutch-dev] junit test failed

2005-08-28 Thread AJ Chen
FAILED nutch\trunk\build.xml:173: Could not create task or type of type: junit. Did I miss anything for junit? Appreciate your help. AJ Chen --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San

[Nutch-dev] Re: junit test failed

2005-08-28 Thread AJ Chen
codes. Apparently, the command ant test does not work. Anybody has an idea how to make the unit test work? AJ Michael Ji wrote: What is junit test standing for? A particular patch? Sorry, if my question is silly. Michael Ji, --- AJ Chen [EMAIL PROTECTED] wrote: I'm a new comer, trying

[Nutch-dev] Re: junit test failed

2005-08-28 Thread AJ Chen
Regards, Fuad Efendi -Original Message- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Sunday, August 28, 2005 9:01 PM To: nutch-dev Subject: junit test failed I'm a new comer, trying to test Nutch for vertical search. I downloaded the code and compiled it in cygwin. But, the unit

[Nutch-dev] Re: junit test failed

2005-08-28 Thread AJ Chen
: ANT_HOME/lib/ant-junit.jar And, copy junit-3.8.1.jar file into apache-ant-1.6.3\lib -Original Message- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 12:00 AM To: nutch-dev@lucene.apache.org Subject: Re: junit test failed I'm using ant1.6.5, which has junit.jar