RE: A few nutch questions

Aled Rhys Jones Sun, 30 Dec 2007 03:19:38 -0800

Cool, thanks for the help.
I've added the nutch-0.9.jar and hadoop-core-0.12.2 jar to my project.  
I wish to make use of features that I don't think are included in the above
jars, but might be in some of the plugin jars; namely spelling suggestions
and subcollections.


If I'm using the nutch api, do I get at these extra functionalities by
simply adding the required jars to my project, or is there a better way to
add plugins?

Also, is there any documentation on using subcollections?  I'm finding it
very hard to find any.  Ideally I would like to be able to be able to mark
urls as belonging to multiple different subcollections, and be able to do a
query across multiple subcollections.  Is this possible?

Thanks for any reply

Cheers
Aled

-----Original Message-----
From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED] 
Sent: 28 December 2007 19:18
To: [email protected]
Subject: RE: A few nutch questions

Good, looks like you have found your way.  I also copied the conf
directory into my webapp and pointed the configuration object at that
for the proper config files.  

My application is sort of crazy, but the NutchBean is easy to use in
any context.

Jeff


-----Original Message-----
From: Aled Rhys Jones [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 28, 2007 8:24 AM
To: [email protected]
Subject: RE: A few nutch questions

Doh! Found the jar ;-)
Out of interest are there any nutch maven2 repositories out there?

Cheers
Aled 

-----Original Message-----
From: Aled Rhys Jones [mailto:[EMAIL PROTECTED] 
Sent: 28 December 2007 13:19
To: '[email protected]'
Subject: RE: A few nutch questions

Thanks Jeff!
You say you created a NutchBean?  I assume you mean you created a bean
in
your application that makes use of the NutchBean?  Which Jar's do I
need in
my application to be able to execute search on NutchBean?  It looks
like I
need to pass in a Hadoop Configuration object that is populated with
site
resource data.  

Thanks again

Aled

-----Original Message-----
From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED] 
Sent: 27 December 2007 19:16
To: [email protected]
Subject: RE: A few nutch questions

Aled,
I've integrated Nutch into my current application by simply creating a
NutchBean and pointing it at the configuration files and indexes.  It
works great for me.

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/NutchB
ean.html

That is the JavaDoc API page for the Nutch Bean.  It is pretty self
explanatory and it took me about 2 hours to get things working the way
I wanted them to.  Good luck.

Jeff
 

-----Original Message-----
From: Aled Rhys Jones [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 27, 2007 11:12 AM
To: [email protected]
Subject: A few nutch questions

Hi everyone

I've got a few questions on Nutch, we're using the latest release
version -
0.9.

We're looking to crawl, index and search roughly 20,000 websites.
Based on
the example of roughly 10KB per page, and a conservative estimate of
100
pages per site (since the sites will be commercial in nature, my guess
is
that most will have more than that), then it equates to about 20GB of
storage required.

Does anyone have some stats on roughly how much bandwidth is required
to
crawl this amount of sites once?

We've tried to start out small, but already our dedicated server host
is
complaining about outbound bandwidth and breaking terms of use.  Are
there
any recommended hosts for crawling using Nutch, or can anyone recommend
hosting particulars to look out for?

Lastly, we'd like to connect to our nutch install via an API, so we can
add
more content to our results.  Our main application is a standard java
web
application with spring, hibernate, mysql etc sitting pretty on tomcat.
We
also (currently) run Nutch on the same tomcat install.  What's the best
way
to communicate to the nutch install from our current application to
provide
it with search engine capabilities?  Does nutch include web services,
or can
we use RMI?

Thank you for any feedback, it would be greatly appreciated.

Cheers
Aled

RE: A few nutch questions

Reply via email to