as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.
AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org
-
Take Surveys
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
--
AJ Chen, PhD
Palo Alto, CA
http
sec ratio seems
very low to me.
How big was your crawldb when you started and how big was it at end?
What kind of filters and normalizers are you using?
--
Sami Siren
AJ Chen wrote:
I checked out the code from trunk after Sami committed the change. I
started
out a new crawl db and run
/jira
--
AJ Chen, PhD
http://web2express.org
-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere
Affects Versions: 0.8.1
Environment: linux and windows
Reporter: AJ Chen
This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to
crawl web on one linux server. With default hadoop
configuration (local file system, no distributed crawling), the Generator
I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing reduce sort
and reduce reduce reduce. This is not necessary since it uses only the
local file system. I'm not familiar with map-reduce code, but guess it may
be
This is solved. I accidentally put log4j.prperties into
ROOT\WEB-INF\classes.
-aj
On 9/7/06, AJ Chen [EMAIL PROTECTED] wrote:
I'm customizing 0.9-dev code for my vertical search engine. After rebuild
the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error
when starting
I'm customizing 0.9-dev code for my vertical search engine. After rebuild
the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error
when starting Tomcat:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path specified)
I'm using nutch-0.9-dev from svn. hadoop.log has records from fetching
except the status line. is there a setting required to print the fetch
status line? the status is set in Fetcher.java: report.setStatus(string),
but where does the report object print the status?
thanks,
--
AJ Chen
http
Groschupf [EMAIL PROTECTED] wrote:
Try to put the conf folder to your classpath in eclipse and set the
environemnt variables that are setted in bin/nutch.
Btw, please do not crosspost.
Thanks.
Stefan
Am 09.07.2006 um 21:47 schrieb AJ Chen:
I checked out the 0.8 code from trunk and tried to set
I checked out the 0.8 code from trunk and tried to set it up in eclipse.
When trying to run Crawl from Eclipse using args urls -dir crawl -depth 3
-topN 50, I got the following error, which started from LogFactory.getLog(
Crawl.class). Any idea what file was not found? There is a url file under
will be searched from the same nutch search interface.
Thanks,
AJ
On 6/16/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
AJ Chen wrote:
I'm about to use nutch to crawl semantic data. Links to semantic data
files
(RDF, OWL, etc.) can be placed in two places: (1) HEAD link; (2)
BODY a
href Does
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
I also see lots of Response content
My vertical search application will use additional factor for page
ranking, which is given to each page at search time. I'm trying to
figure out a good way to integrate this additional dynamic factor into
the nutch score. I'll appreciate any suggestions or pointers.
It would be great if I
connection pool
problem in httpclient? If yes, I can filter out url containing these trouble
ports before httpclient is fixed.
Thanks,
AJ
On 12/26/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
AJ Chen wrote:
Stefan,
Here is the trace in my log. My SSFetcher (for site-specific fetch) is
the
same
I have seen repeatedly the following severe errors during fetching
400,000 pages with 200 threads. What may cause Host connection pool
not found? This type of error must be avoided, otherwise the fetcher
will stop prematurely.
051224 075950 SEVERE Host connection pool not found,
)
at vscope.crawl.SSCrawler.main(SSCrawler.java:251)
Thanks,
AJ
On 12/25/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
Can you provide a detailed stacktrace from the log file?
Stefan
Am 25.12.2005 um 23:38 schrieb AJ Chen:
I have seen repeatedly the following severe errors during fetching
400,000 pages
Although tagging is not directly related to nutch, I think combining nutch
search and the ability to tag search result pages will be quite powerful.
Anyone has implemented tagging on nutch search site? Is there a java open
source package for tagging function?
AJ
I'm using elicpse for nutch java code and trying to set up eclipse for
debugging JSP pages. I have got WST plugin installed, created a new dynamic
web project called nutch071web, and imported all the webcontent and jars.
But, it failed to run index.jsp page, see error message below. Is anyone
Has anyone merged indices from two separate webdb? I have two separate webdb
and need to find a good way to combine them for unified search.
AJ
-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 4:03 PM
To: nutch-dev@lucene.apache.org
Subject: merge indices from multiple webdb
Has anyone merged indices from two separate webdb? I have two
separate webdb and need to find a good way to combine them
and then build one more segment again.
Thank you,
Andrey
-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 2:02 PM
To: nutch-dev@lucene.apache.org
Subject: Re: merge indices from multiple webdb
Thanks so much, Graham. This should do it.
A related
I try to fetch as fast as it can by using more threads on a large fetch
list. But, the fetcher starts download at speed much lower than the full
bandwidth allows. And the start download speed varies a lot from run to run,
200kb/s to 1200kb/s on my DSL line. This variation also happens on T1 line
, Rod Taylor [EMAIL PROTECTED] wrote:
On Thu, 2005-10-13 at 13:35 -0700, AJ Chen wrote:
I try to fetch as fast as it can by using more threads on a large fetch
list. But, the fetcher starts download at speed much lower than the full
bandwidth allows. And the start download speed varies a lot
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep status:) What's the bandwidth you have?
-AJ
On 10/11/05, Fuad Efendi (JIRA) [EMAIL PROTECTED] wrote:
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Fuad Efendi updated
Another observation: when the same size fetch list and same number of
threads were used, the fetcher started at different speed in different runs,
ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation
in downlaod speed could be due to the variation in DSL connection. If using
several days at current
speed - just too slow. I'm planning to get more bandwidth. Could someone
share their experience on what stable rate (pages/sec) can be achieved using
3 mbps or 10 mbps inbound connection?
Thanks,
AJ
On 9/28/05, AJ Chen [EMAIL PROTECTED] wrote:
I started the crawler
I started the crawler with about 2000 sites. The fetcher could achieve
7 pages/sec initially, but the performance gradually dropped to about 2
pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
and I used 500 threads. What are the main causes of this slowing down?
Below
Jerome, thanks a lot. This is helpful.
-AJ
Jérôme Charron wrote:
Following the tutorial, I redirect the log messages to a log file. But,
when crawling 1 million pages, this log file can become hugh and writing
log messages to a huge file can slow down the fetching process. Is
there a better
Following the tutorial, I redirect the log messages to a log file. But,
when crawling 1 million pages, this log file can become hugh and writing
log messages to a huge file can slow down the fetching process. Is
there a better way to manage the log? maybe saving it to a series of
smaller
/13/05, Michael Ji [EMAIL PROTECTED] wrote:
I think this scenario will work.
Just a bit worry about the filter performance if the
domain site number is in scale of thundreds of
thousands.
Michael Ji
--- AJ Chen [EMAIL PROTECTED] wrote:
Once I create a webDB, can I inject new root urls
Once I create a webDB, can I inject new root urls to the same webDB
repeatly? After each injection, run as many cycles of
generate/fetch/updatedb to fetch all web pages from the new sites. I think
this will allow me to gradually build a comprehensive vertical site. Any
comment or suggestion?
Andrzej, Thanks.
A related question: Some of the sites I crawl use https: or redirect to
https:. Nutch default setting does not recognize https: as valid url.
Is there a way to crawl url starting with https:?
-AJ
Andrzej Bialecki wrote:
AJ Chen wrote:
Hi Andrzej,
Thanks
,
050910 150341 fetch of
http://www.cellsciences.com/content/c2-contact.asp failed with:
java.lang.Exception: org.apache.n
utch.protocol.http.HttpException: Not an HTTP
url:https://www.cellsciences.com/content/c2-contact.asp
Any idea what happens?
-AJ
Andrzej Bialecki wrote:
AJ Chen wrote
My understanding is that only up to the maximum number of outlinks are
processed for a page when updating the web db. I assume the same page
won't get fetched and processed again in the next fetch/update cycles,
then you won't get those outlinks exceeding the maximum number no matter
how many
Jack,
Set the max to 100, but run 10 cycles (i.e., depth=10) with the
CrawlTool. You may see all the outlinks are collected toward the end. 3
cycles is usually not enough.
-AJ
Jack Tang wrote:
Yes, Stefan.
But it missed some URLs, and I set the value to 3000, then everything is OK
/Jack
I'm also thinking about implementing an automated workflow of
fetchlist-crawl-updateDb-index. Although my project may not require
NDSF because it only concerns about deep crawling of 100,000 sites, an
appropriate workflow is still needed to automatically take care of
failed urls, newly-added
From reading http://wiki.apache.org/nutch/DissectingTheNutchCrawler, it
seems that a new urlfilter is a good place to extend the inclusion regex
capability. The new urlfilter will be defined by urlfilter.class
property, which gets loaded by the URLFilterFactory.
Regex is necessary because you
for a public beta. I'll be sure to post here when we're
finally open for business. :)
--Matt
On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler,
it seems that a new urlfilter is a good place to extend the
inclusion regex capability
-platform
Reporter: AJ Chen
There is a gap between whole-web crawling and single (or handful) site
crawling. Many applications actually fall in this gap, which usually require to
crawl a large number of selected sites, say 10 domains. Current CrawlTool
is designed for a handful of sites. So
Seeded with a list of urls, nutch whole-web crawler is going to take
unknown number of cycles of generate/fetch/updatedb in order to drive to
some level of completeness, both for internal links and outlinks. It's
crucial to monitor the progress. I'll appreciate some suggestions or
best
FAILED
nutch\trunk\build.xml:173: Could not create task or type of type: junit.
Did I miss anything for junit? Appreciate your help.
AJ Chen
---
SF.Net email is Sponsored by the Better Software Conference EXPO
September 19-22, 2005 * San
codes.
Apparently, the command ant test does not work. Anybody has an idea
how to make the unit test work?
AJ
Michael Ji wrote:
What is junit test standing for? A particular patch?
Sorry, if my question is silly.
Michael Ji,
--- AJ Chen [EMAIL PROTECTED] wrote:
I'm a new comer, trying
Regards,
Fuad Efendi
-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Sunday, August 28, 2005 9:01 PM
To: nutch-dev
Subject: junit test failed
I'm a new comer, trying to test Nutch for vertical search. I downloaded
the code and compiled it in cygwin. But, the unit
:
ANT_HOME/lib/ant-junit.jar
And, copy junit-3.8.1.jar file into apache-ant-1.6.3\lib
-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 12:00 AM
To: nutch-dev@lucene.apache.org
Subject: Re: junit test failed
I'm using ant1.6.5, which has junit.jar
45 matches
Mail list logo