Hi:
I am not an nutch expert though. But I think ur problem is easy.
1. make a list of seed urls in a file under urls folder
2. add all of the domain that you want to crawl to crawl-urlfilter.txt, just
like this:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*aaa.edu/
Hi:
Why do u think nutch can't find
http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
Actually http://app02.laopdr.gov.la/ is the same page as
http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
if you find http://app02.laopdr.gov.la in your log, the
will be ignored. This is an effective way to limit the crawl to
include only initially injected hosts, without creating complex URLFilters.
/description
/property
Tony Wang wrote:
that helps a lot! thanks!
2009/3/2 yanky young yanky.yo...@gmail.com
Hi:
I am not an nutch expert though
Hi:
if you want adaptive fetching strategy only for specific domains, you can do
this:
write your own another *AdaptiveFetchSchedule*, see MyAdaptiveFetchSchedule
MyAdaptiveFetchSchedule extends *AdaptiveFetchSchedule {
*void
://app02.laopdr.gov.la/ and
http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
in my fetch log, but I just cannot find the page.
I'm doubting about dynamic pages... is that reasonable?
2009/3/3 yanky young yanky.yo...@gmail.com
- 显示引用文字 -
Hi:
Why do u think nutch
words to search, ex: search
opportunity or good opportunity, I found nothing.
why?
Yves
2009/3/4 yanky young yanky.yo...@gmail.com
Hi:
because they are actually the same page, you can only fine one. here is
what
i see when i use wget to fetch http://app02.laopdr.gov.la/:
C
Hi:
Dog cutting has ever write a wiki about hardware requirement of nutch, you
can check it out
http://wiki.apache.org/nutch/HardwareRequirements
good luck
yanky
2009/3/4 John Martyniak j...@beforedawn.com
Regarding the machine, you could run it on anything, it all depends what
kind of
in the Lucene Document look
like, is there
maybe a truncation or did the page not get parsed right?
On Mar 3, 2009, at 6:20 PM, yanky young wrote:
sorry, i have no idea about this question. i guess there must be some kind
of index leakage in nutch indexing process. some words must be ignored
Hi:
you said that you are crawling college websites and use XPath to extract
classes or courses information. That's good. But how do you determine a web
page is about classes or not?
If you just crawled the whole web site, that must be a complete crawl and
thus a complete tree.
If you use some
you can see hadoop log to find a clue
good luck
yanky
2009/3/5 dealmaker vin...@gmail.com
I am using the nutch nightly build #741 (Mar 3, 2009 4:01:53 AM). I am at
the final phrase of crawling following the tutorial on Nutch.org website.
I
ran the following command, and I got exception
Hi:
good suggestion. and I'd like to share something about asking question in a
smart way with all of you:
http://www.rtfa.net/esrs-how-to-ask-questions-the-smart-way
2009/3/5 Venkateshprasanna prasanna...@yahoo.co.in
A lot of queries in the forum go unanswered, either due to lack of
check your yahoo email box and find that email sent from nutch-user
mailinglist and u will find any links you want there
2009/3/5 Edward Chen czy11...@yahoo.com
Hi, how to remove me from this mailing list ? please give me a link.
Good Luck.
From: yanky
Hi:
maybe you'd better paste your plugin.folders property config in
nutch-site.xm. when you run nutch crawl, nutch will load plugin as needed
from plugin.folders.there are 2 places to check:
1. property plugin.folders should be configured to $NUTCH_HOME/src/plugin
or $NUTCH_HOME/build/plugins,
domain url filter seems in 1.0, maybe u can just checkout this plugin code
from 1.0 trunk and build it into your 0.9 code base
good luck
yanky
2009/3/14 MyD myd.ro...@googlemail.com
Where can I find the domain urlfilter? I'm using the branch 0.9...
Cheers,
Markus
Dennis Kubes-2 wrote:
Hi:
I also agree that the most usage scenarios of nutch are in vertical search
area. and in some unusual case users may don't even use nutch indexing at
all. they just crawl some pages as mirror purpose. and in some cases of
vertical search, user only need a fraction of pages, e.g. house rent
Hi:
it seems you are writing to a xml file with multiple threads. I guess it can
be done by using BlockingQueue in java 1.5 concurrency api. you just add any
url entry into the queue from multiple producer threads, and use a separate
consumer thread to retrive url entries from the queue and write
Hi:
you can see source code of Crawl class which can be used to start nutch by
java command without cgywin.
java -D... -classpath ... org.apache.nutch.crawl.Crawl urls -depth 10 -topN
1000
good luck
yanky
2009/3/18 MyD myd.ro...@googlemail.com
This is an interesting question. If you know
Hi:
you can put any parameters in nutch-site.xml as property settings, and get
property from your plugin class by conf.get(your property name)
good luck
yanky
2009/3/18 MyD myd.ro...@googlemail.com
Hi @ all,
where is it possible to set plugin (my own plugin) specific parameters /
Hi:
according to my understanding, in nuch 1.0, you can configure nutch to
recrawl with a specific schedule:
see this issue: http://issues.apache.org/jira/browse/NUTCH-61
and this class: AdaptiveFetchSchedule
by the way, there is no way to configure nutch to only recrawl changed
website,
Hi:
i guess the urls you mentioned are all directed to the same jsp or servlet,
apparently they all begin with
http://app02.laopdr.gov.la/ePortal/news/detail.actionhttp://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome.
the difference is the request_locale
/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
in
the urls directory
2009/3/19 yanky young yanky.yo...@gmail.com
Hi:
i guess the urls you mentioned are all directed to the same jsp or
servlet,
apparently they all begin with
http://app02.laopdr.gov.la/ePortal
reach those
two urls
so i am worried .
2009/3/20 yanky young yanky.yo...@gmail.com
that must work, but it seems weird. u know, from the seed url you given,
nutch will crawl from the seed url and the whole crawled pages is
actually
a
tree. the root node is the seed url. if u can
in the url.txt?
2009/3/20 yanky young yanky.yo...@gmail.com
I think my guess is right. I just see the code of that page.
those two urls are generated by javascript function:
function jump(lan)
in this case, nutch might not be that smart to recognize this kind of
generated url
but if you
Hi:
I have got the same error. And I installed jdk 1.6, then it works.
It seems a bit weird because I see the javac requirement in the build.xml is
1.5 but it broke.
I guess maybe hadoop jar was compiled in java 1.6 and class compatibility is
1.6 so you can't run it in java 1.5.
yanky
hi:
there is a index-more plugin that index some information about content type.
u can have a look.
2009/4/5 dealmaker vin...@gmail.com
Hi,
I am trying to find out the encoding and format of the content stored in
the index. I modified the code in BasicIndexFilter.java to store the
dealmaker vin...@gmail.com
Thanks. Is there similar thing for encoding? I don't want to it to
re-detect the encoding again for performance reason.
yanky young wrote:
hi:
there is a index-more plugin that index some information about content
type.
u can have a look.
2009/4/5
Hi:
I m using nutch 0.9 as base to do a project. When i use local files in
windows xp2 system to test, I found that protocol-file plugin just breaks.
For example:
String url =
file:///C:/cygwin/home/data/train/cv/Brendan%20O'Leary%20CV%20html.html;
try {
ProtocolOutput
Hi:
if you just use nutch crawl command, you should put your domain names
in crawl-urlfilter.txt
like this:
+^http://([a-z0-9]*\.)bbc.co.uk/hindi
or
+^http://www.bbc.co.uk/hindi
good luck
2009/4/6, Ankur Garg garg.ankur.2...@gmail.com:
Hi All,
I am trying to crawl BBC Hindi site
if nutch crawled some pages, there should be some fetching log in
stdout like this:
fetching
http://www.law.harvard.edu/library/special/visit/reading-room-rules.html
check your hadoop.log to see if there are lines like the above. or
change your log4j.properties and set debug mode to Fetcher
Hi:
It is more wise to store files in DFS rather than in database.
database is for structured data or data with schema. flat file is also
not good for large data storage. DFS provide out-of-box replication
for fault tolerance and what's more than is the mapreduce framework
can be used on DFS to
Hi guys:
I am using nutch in a project. But I found that nutch repeat fetching some
pages. For example:
http://www.me.washington.edu//people/faculty/wang/
this is a page fetched. But also, there are some urls like this in
commandline output:
http://www.me.washington.edu//people/faculty/wang/
skovacevi...@gmail.com
you can disable this in url-filter file, it is disabled by default. you ran
into a loop on that site
On Wed, Apr 8, 2009 at 7:32 AM, yanky young yanky.yo...@gmail.com wrote:
Hi guys:
I am using nutch in a project. But I found that nutch repeat fetching
some
Hi:
Of course u can look into code and add some debug lines in ur case. Just
look at protocol-file plugin, which is supposed to process file:// scheme.
You can find this plugin code in ${nutch_home}/src/plugin/protocol-file
and as of nutch fetching list, you can dump crawldb by nutch readdb
why not just add -Xms -Xmx jvm parameters to see if it still happens
2009/4/9 srinivas jaini srinivasja...@gmail.com
I've checked out code and am running crawl and get this error; any
thoughts?
environment: java 6, eclipse
009-04-08 01:22:41,658 INFO crawl.Injector
Hi:
in nutch-site.xml you can define these properties:
property
namefetcher.server.delay/name
value5.0/value
descriptionThe number of seconds the fetcher will delay between
successive requests to the same server./description
/property
property
namefetcher.max.crawl.delay/name
Hi:
i am not sure i understand your question. can you give more details about
your application?
your data is a list of classes and some faculty info.
so what's the structure or schema of these data? does the data come from
database?
If all of ur data is web pages, here is my hint: use some kind
Hi:
I have encountered a similar problem with local windows file system search
with nutch 0.9. You can see my post here.
http://www.nabble.com/nutch-0.9-protocol-file-plugin-break-with-windows-file-name-that--contains-space-td22903785.html.
Hope it helps.
good luck
yanky
2009/4/13 Fadzi
the db.max.outlinks.per.page to 1000 from 100
and i started getting exactly 500 documents instead of the 600 or so. So
i changed it to -1 and i am still getting 500 docs! Not sure whats going
on here.
On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote:
Hi:
I have encountered a similar problem with local
to 1000 from
100
and i started getting exactly 500 documents instead of the 600 or so.
So
i changed it to -1 and i am still getting 500 docs! Not sure whats
going
on here.
On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote:
Hi:
I have encountered a similar problem
Hi:
just use JDK 1.6 instead. That will be fine.
2009/4/20 Filipe Antunes fantu...@tecnica.cc
I Can't build Nutch with Ant.
My ant version is 1.7.1 and i'm on a Mac OS X 10.4 PowerPC
My java version is 1.5.0.
I can't figure out why i'm having the error class file has wrong version
50.0,
Hi:
I have done focused crawling with nutch a few months ago. What I did is to
override some methods of scoring-opic plugin before and after passing, just
as Krugler said. I have customized scoring meta data. And I even managed to
integrate text classifier such as Baysian classifier to
and what they are for really.
Is there any documentation that I should read first.
-Ray-
2009/5/14 yanky young yanky.yo...@gmail.com
Hi:
I have done focused crawling with nutch a few months ago. What I did is
to
override some methods of scoring-opic plugin before and after passing
Hi:
Myabe you just need to add url filter to your regex-*urlfilter*.txt
configuration file.
And if the feeds are rss or atom format, you should activate *parse-rss
plugin*. Just add it into your nutch-site.xml plugins part.
good luck
yanky
2009/6/11 Xalan aaven...@gmail.com
Regards,
I
43 matches
Mail list logo