Re: Multiple indexes on a single server instance.

2006-05-30 Thread sudhendra seshachala

Yes you nailed it. I am not sure, if it is doable. I am still trying to figure 
that..
  My problem is I capture same or similar data from all sites. I should be able 
to apply those extra points.
   
  Stefan Neufeind [EMAIL PROTECTED] wrote:
  sudhendra seshachala wrote:
 I am experiencing a similar problem.
 What I have done is as follows.
 I have different parse-plugin for each site ( I have 3 sites to crawl and 
 fetch data). But I capture data into same format I call it datarepository.
 I have one index-plugin which indexes on data repository and one query-plugin 
 on the data repository,
 I dont have to run multiple instances. I just run one instance of search 
 engine.
 However the parse configuration is different for each site so I run different 
 crawler for each site
 Then I index and merge all of them. So far the results are good if not WOW.
 I still have to figure a way of ranking the page. For example I would like to 
 be able to apply ranking on the data repository. Let me know If I was clear...

Hi,

not sure if I got you right with your last point, but it just came to my
mind:
It would be nice to be able to have something like
If it's from indexA, give it 100 extra-points - if from indexB give it
50 extra-points. Or some if indexA give it 20% extra-weight or so.
But I don't believe this is easily doable. Or is it?

I got a similar problem with languages: give priority to documents in
German and English. But somewhere after those results also list
documents in other languages. So I'd need to be able to give
extra-points on a per-language-basis, based on the indexed
language-field, right?


Regards,
Stefan

 Stefan Groschupf wrote:
 I'm not sure what you are planing to do, but you can just switch a 
 symbolic link on your hdd driven by a cronjob to switch between index 
 on a given time.
 May be you need to touch the web.xml to restart the searcher.
 If you try to search in different kind of indexes at the same time, I 
 suggest to merge the indexes and have a kind keyfield for each of the 
 indexes.
 For example add a field to each of your indexes names indexName and 
 put A, B and C as value into it.
 Than you can merge your index. During runtime you just need to have a 
 queryfilter that extend a indexName:A or indexName:B to the query 
 string.
 
 Does this somehow help to solve your problem?
 Stefan
 
 Am 23.05.2006 um 15:26 schrieb TJ Roberts:
 
 I have five different indexes each with their own special 
 configuration. I would like to be able to switch between the 
 different indexes dynamically on a single instance of nutch running 
 on jakarta-tomcat. Is this possible, or do I have to run five 
 instances of nutch, one for each index?



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Multiple indexes on a single server instance.

2006-05-30 Thread sudhendra seshachala
Yes. That is wha I am trying. But for some reason it is not working..
  Does these fields should be lower case only. ?
   
  

Andrzej Bialecki [EMAIL PROTECTED] wrote:
  Stefan Neufeind wrote:
 sudhendra seshachala wrote:
 
 I am experiencing a similar problem.
 What I have done is as follows.
 I have different parse-plugin for each site ( I have 3 sites to crawl and 
 fetch data). But I capture data into same format I call it datarepository.
 I have one index-plugin which indexes on data repository and one 
 query-plugin on the data repository,
 I dont have to run multiple instances. I just run one instance of search 
 engine.
 However the parse configuration is different for each site so I run 
 different crawler for each site
 Then I index and merge all of them. So far the results are good if not WOW.
 I still have to figure a way of ranking the page. For example I would like 
 to be able to apply ranking on the data repository. Let me know If I was 
 clear...
 

 Hi,

 not sure if I got you right with your last point, but it just came to my
 mind:
 It would be nice to be able to have something like
 If it's from indexA, give it 100 extra-points - if from indexB give it
 50 extra-points. Or some if indexA give it 20% extra-weight or so.
 But I don't believe this is easily doable. Or is it?

 I got a similar problem with languages: give priority to documents in
 German and English. But somewhere after those results also list
 documents in other languages. So I'd need to be able to give
 extra-points on a per-language-basis, based on the indexed
 language-field, right?
 


This is not only doable, but fairly easy - just add these fields to the 
index through a custom IndexingFilter plugin, and then implement a 
corresponding QueryPlugin that will expand your query appropriately - 
this prioritization that you describe is equivalent to adding a 
non-required and non-prohibited clause to a Lucene query. Please see how 
it's done in the existing index-more/query-more and 
index-basic/query-basic plugins.

-- 
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: changing ranking

2006-05-20 Thread sudhendra seshachala


If some has to adopt the plugin, it has to go with new crawling. Will there be 
a  way, where we could apply these scoring mechanisms to existing already 
fetched, indexed and merged pages too.
Can you please shed some light?

Thanks


Andrzej Bialecki [EMAIL PROTECTED] wrote: Ken Krugler wrote:
 Eugen Kochuev wrote:
 Hello Andrzej,

 Please see the scoring API - you can write a plugin that manipulates
 page scores according to your own idea.

 Thanks a lot for your answer, but could you please shed some more
 light onto scoring technique used in the Nutch?
 As I can see from the source code Nutch uses something similar to the
 pagerank algorithm propagating page scores through outlinks, but 
 only one
 iteration is used (while pagerank requires several iterations to
 converge).

 That's a bit complicated subject - I could either explain this in 
 very general terms, or suggest that you read the paper that underlies 
 the current Nutch implementation (with a twist). Please see the 
 comment in OPICScoringFilter.java for the link to the paper.

 I've started writing up a description of the changes that I think need 
 to be made to Nutch to really implement the OPIC algorithm, as 
 described by by the Adaptive On-Line Page Importance Computation 
 paper (ACM 1-58113-680-3/03/0005).

 Should I just open a JIRA issue, and dump what might be a pretty long 
 write-up into it?

Yes, please do - I'd love to implement this in that original form, even 
if it would go into another plugin ...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: [Nutch-general] Re: Extending Nutch talk, May 11th, Palo Alto, CA

2006-05-10 Thread sudhendra seshachala

Sone one needs to send me the timings and how long the conference will run...
  May be I can just have 10 nuimbers.. The conference has a service where in 
the session can be recorded and downloaded
  Please suggest ASAP or better still in US
  Call me @ 281 516 2495 or (408 203 9960) so that we can finalize..
  I would realli like to be in... and I hope I  have decent support too :)
   
  Thanks
  Sudhi
   
  TDLN [EMAIL PROTECTED] wrote:
  +1

I would be interested as well.

Rgrds, Thomas Delnoij

On 5/10/06, [EMAIL PROTECTED] wrote:
 +1 to this!
 I won't be in San Francisco on the 11th, but would be interested in 
 seeing/listening either in real-time or a recorded version.

 Thanks,
 Otis

 - Original Message 
 From: sudhendra seshachala 
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, May 9, 2006 8:18:36 PM
 Subject: [Nutch-general] Re: Extending Nutch talk, May 11th, Palo Alto, CA

 Is there a way to pod/video cast this... or atleast conference call (Just 
 listening mode) ... I have a personal account. May be I can sponsor the 
 Listening mode conference...Please let me know If I can be of any assitance.
 It will really help folks who are outside bay area.
 There is life (hitech) outside bay area too.. in US.

 Thanks

 Stefan Groschupf wrote: Hi Nutch Users,
 Doug already mentioned it in the developers list (thanks!), but for
 those of you that does not subscribe the developer list...
 The next CommerceNet Thursday Tech Talk will be about Extending
 Nutch. I'll present a few slides about the plugin system and meta
 data 'flow' in nutch.
 http://events.commerce.net/?p=58
 I would be glad to hear about experience, needs and thoughts about
 this topic from nutch users around the bay area. :)

 Cheers,
 Stefan




 Sudhi Seshachala
 http://sudhilogs.blogspot.com/




 -
 Love cheap thrills? Enjoy PC-to-Phone calls to 30+ countries for just 2¢/min 
 with Yahoo! Messenger with Voice.






  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail goes everywhere you do.  Get it on your phone.

Re: Extending Nutch talk, May 11th, Palo Alto, CA

2006-05-09 Thread sudhendra seshachala
Is there a way to pod/video cast this... or atleast conference call (Just 
listening mode) ... I have a personal account. May be I can sponsor the 
Listening mode conference...Please let me know If I can be of any assitance.
It will really help folks who are outside bay area.
There is life (hitech) outside bay area too.. in US.

Thanks

Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Nutch Users,
Doug already mentioned it in the developers list (thanks!), but for  
those of you that does not subscribe the developer list...
The next CommerceNet Thursday Tech Talk will be about Extending  
Nutch. I'll present a few slides about the plugin system and meta  
data 'flow' in nutch.
http://events.commerce.net/?p=58
I would be glad to hear about experience, needs and thoughts about  
this topic from nutch users around the bay area. :)

Cheers,
Stefan

  


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Love cheap thrills? Enjoy PC-to-Phone  calls to 30+ countries for just 2¢/min 
with Yahoo! Messenger with Voice.

Nutch ADMIN -GUI Mirror

2006-05-04 Thread sudhendra seshachala
I have hosted the bundle at the following URL.
   
  http://68.178.249.66/nutch-admin/nutch-0.8-dev_guiBundle_05_02_06.tar.gz
   
  I hope it helps.
   
  Thanks
  Sudhi

   


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Love cheap thrills? Enjoy PC-to-Phone  calls to 30+ countries for just 2¢/min 
with Yahoo! Messenger with Voice.

Re: GUI

2006-05-04 Thread sudhendra seshachala
It just got completed few days back.
  You could beta test it downloading from 
  http://68.178.249.66/nutch-admin/nutch-0.8-dev_guiBundle_05_02_06.tar.gz
   
  It is still in early stages... So I would not rank it as stable..
   
  Thanks
   
   
  Markus Franz [EMAIL PROTECTED] wrote:
  Hello!

Are there any powerful and stable (or almost stable) administration GUIs
for Nutch? Did you test them?

Regards,
Markus

-- 
Danziger Weg 2
97350 Mainbernheim
Germany
--
+491626077635
[EMAIL PROTECTED]
--




  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.

Re: Admin Gui beta test (was Re: ATB: Heritrix)

2006-04-28 Thread sudhendra seshachala
Hi Stefan
  I would be willing to host the app.
  I have virutal dedicated server from Godaddy with Fedora core2 and apache 
webserver and tomcat running.
  The IP address is http://68.178.249.66 Right now, on webserver side, I have a 
default page (hosted by godaddy running)
  But can make sure the Admin GUI is running.. I might need some help, but 
should not be a problem at all.
   
   
  Thanks
  Sudhi
  

Stefan Groschupf [EMAIL PROTECTED] wrote:
  Hi there,

since building the gui is some how complicated I was thinking about 
providing a ready to use binary.
This may be would help to get some more beta testers we currently 
looking for.
Any thoughts?

However I afraid that this would hit my server to hard and I have to 
pay for traffic. :-/
Does any one has an idea where we can mirror this file for free?
Any volunteer is very welcome.

Thanks.
Stefan




Am 28.04.2006 um 15:14 schrieb Aled Jones:

 Thanks for your replies guys. I hadn't realised that the admin gui 
 was
 already in development.
 We should be able to cope till it gets released ;-)

 Thanks again
 Aled

 -Neges Wreiddiol-/-Original Message-
 Oddi wrth/From: Dan Morrill [mailto:[EMAIL PROTECTED]
 Anfonwyd/Sent: 28 April 2006 14:07
 At/To: nutch-user@lucene.apache.org
 Pwnc/Subject: RE: Heritrix

 Aled,

 I used heritrix before going over to nutch, while it is an
 excellent program, with lots of good things to offer, it
 didn't quite meet my need, and when designing the
 architecture had too many dependencies for me to be comfortable with.

 If you want to run an internet archive though, heritrix can
 not be beat, if you want to run a search engine, nutch is a
 good choice.

 My personal opinion.
 r/d

 -Original Message-
 From: Aled Jones [mailto:[EMAIL PROTECTED]
 Sent: Friday, April 28, 2006 1:59 AM
 To: nutch-user@lucene.apache.org
 Subject: Heritrix

 Hi

 Anyone used Heritrix (http://crawler.archive.org/) as a
 crawler? How does it compare with the Nutch crawler? Can
 Nutch serve its crawled
 results? Main reason I'm interested is that it has a WUI interface
 that might make maintenance for the IT guys easier, although
 I know that some of you guys are working on an interface.

 Cheers
 Aled


 ###

 This message has been scanned by F-Secure Anti-Virus for
 Microsoft Exchange.
 For more information, connect to http://www.f-secure.com/
 **
 **
 This e-mail and any attachments are strictly confidential and
 intended solely for the addressee. They may contain
 information which is covered by legal, professional or other
 privilege. If you are not the intended addressee, you must
 not copy the e-mail or the attachments, or use them for any
 purpose or disclose their contents to any other person. To do
 so may be unlawful. If you have received this transmission in
 error, please notify us as soon as possible and delete the
 message and attachments from all places in your computer
 where they are stored.

 Although we have scanned this e-mail and any attachments for
 viruses, it is your responsibility to ensure that they are
 actually virus free.




 ###

 This message has been scanned by F-Secure Anti-Virus for Microsoft 
 Exchange.
 For more information, connect to http://www.f-secure.com/

 ** 
 **
 This e-mail and any attachments are strictly confidential and 
 intended solely for the addressee. They may contain information 
 which is covered by legal, professional or other privilege. If you 
 are not the intended addressee, you must not copy the e-mail or the 
 attachments, or use them for any purpose or disclose their contents 
 to any other person. To do so may be unlawful. If you have received 
 this transmission in error, please notify us as soon as possible 
 and delete the message and attachments from all places in your 
 computer where they are stored.

 Although we have scanned this e-mail and any attachments for 
 viruses, it is your responsibility to ensure that they are actually 
 virus free.




-
blog: http://www.find23.org
company: http://www.media-style.com





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail goes everywhere you do.  Get it on your phone.

Re: Beagle and Nutch

2006-04-27 Thread sudhendra seshachala
For Searches, it still uses Lucene (Dot Lucene) How could it be much different 
except Beagle is using C# rather than java as in nutch.
  I would be vry interested, how it would perform though..
  How easy to set it all up.. 
   
  Thanks
  Sudhi

Andrew Libby [EMAIL PROTECTED] wrote:
  
Has anyone attempted to accomplish the same things with Nutch
that are being accomplished by the Beagle project
(http://beaglewiki.org/Main_Page)?

I'm very interested in working with something like Beagle, however
I'm using Nutch for other things I'm doing, and am looking for any
excuse to
get deeper into it for learning purposes.

Thanks.

Andy

-- 
Andrew Libby 
[EMAIL PROTECTED]
http://philadelphiariders.com/





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.

Re: Where to put the nutch-site.xml ?

2006-04-19 Thread sudhendra seshachala
There are two ways
Bundle with the jar file itself which is in Web-INF/lib folder
or you can go and add it conf under Module_NAME/WEB-INF/Conf.
Tomcat restart is required, if you modify the conf folder and modify the jar in 
the lib folder.

Hope this helps./

ahmed ghouzia [EMAIL PROTECTED] wrote: Dear nutchers

I comleted a successful crawling process with nutch   
0.7.1 , and i am trying to make the search

1-I have my {segments, db and index} resulted from the
   last crawl  located at:
 /home/ahmed/Desktop/Downloads/nutch/bin/
 crawl.testagain
   and tomcat is located at:
   /home/ahmed/Desktop/Downloads/tomcat/

2- I have edited the nutch-site.xml so that the 
  searcher.dir refers to 
/home/ahmed/Desktop/Downloads/nutch/bin/
crawl.testagain

3-then i put a copy of nutch-site.xml at ~/tomcat/
 webapps/nutch-0.7.1/WEB-INF/classes

4-then i restarted tomcat, then tried to browse
http://   localhost:8080/ but it gave the following
error
   
HTTP Status 500 - No Context configured to process
this request
5- I think that the problem is with the location to
put in the nutch-site.xml

Where exactly can i put it?










__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ 
countries) for 2¢/min or less.

Re: Adding Level to Website Parse Data

2006-04-13 Thread sudhendra seshachala
Dennis,
  I am in the same dilemma as you are.
  Here are my thoughts.
   
  1. I am planning to write the Plugin to do it where in the plugin can be 
modified based on the site map and levels
  2. The Fetcher itself can be modified. But again code merging with latest 
contributons fixes and enhancement from community will be very hard.
  3. Other way is to write a prefetcher which will fetch all the urls from a 
site, populate the file. Then the Nutch Crawler can be triggered to crawl the 
prefetched urls. Within the prefetched url pages, any unnecessary URLs not to 
be crawled, will have to be ignored. I am still trying a way to do this.
   
  Please share your thoughts..
  Thanks
   
  

Dennis Kubes [EMAIL PROTECTED] wrote:
  I am trying to modify Nutch to add level to the website parse data. 
What I mean by this is suppose you start parsing a website at its 
homepage that would be level one. Any links in the same site from the 
homepage would be level two, links from those pages would be level three 
and so on. I am only counting links in the same site.

How would I go about modifying Nutch to handle this? I was thinking 
that I would have to modify Fetcher to do this, adding the level to the 
parse metadata. What I am not gettings is how would I get the link 
level initially? I was thinking I would have to modify something in the 
generator but didn't know what.

Dennis



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! 
Messenger with Voice.

Re: Saving Metadata to Mysql

2006-04-11 Thread sudhendra seshachala
Sorry to just jumpping in.
We have doc id associated when we index.  We could store the doc id in mysql 
table.We could use the docid to query the nutch database..
When parsing, capture things needed as part of metadata
Index the metadata. the docId associated is stored in mysql.

Does that give any idea ?...
Please do share your concerns. I am working on a similar stuff where eventually 
we have to adopt a database.

Thanks



John Reidy [EMAIL PROTECTED] wrote: I am looking at something similar.

I would guess the place to put it is the indexer. As I understand it the 
parser runs for just about everything fetched, however the indexer is 
only run for pages you want to index.
I am also looking at having static objects (Eg a connection) that is 
initialise when the plugin is loaded, ideally through the startup method.

Regards

John

Hey all,
I have writen a custom HTML parser and indexer.  I would like to save some
information that I have gathered during the parse in a Mysql DB.  I imagine
there could be some performance hit here (e.g. connecting to db).  What's
the best place to add code to save this information - the parser or the
indexer?

-Mike
--
View this message in context: 
http://www.nabble.com/Saving-Metadata-to-Mysql-t1389216.html#a3732992
Sent from the Nutch - User forum at Nabble.com.

  





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
How low will we go? Check out Yahoo! Messenger’s low  PC-to-Phone call rates.

RE: Nutch 500 Error

2006-04-11 Thread sudhendra seshachala
check the nutch-default.xml
there should be a property searcher.dir
Provide the path for the index folder.
Better still copy the property node and paste it in nutch-site.xml
provide the path for the index folder.
For ex:
If the index folder is stored as
home/nutch/crawl
- crawldb
- segments
- index
- indexes

point searcher.dir to home/nutch/crawl.
Hope this helps.

Thanks
Sudhi

Paul Stewart [EMAIL PROTECTED] wrote: Thanks I was doing the java command 
wrong...

Back to my original problem - I re-ran throught the entire tutorial to
ensure I was doing it right and it seems proper How do I tell Nutch
where to look specifically in the code for the segments and indexes in
case it is in the wrong place?

All the best,
Paul
 

-Original Message-
From: sudhendra seshachala [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 06, 2006 12:02 PM
To: nutch-user@lucene.apache.org
Subject: RE: Nutch 500 Error

It should be java -versionI think.

Paul Stewart 
 wrote:  Thanks for the reply...
I apologize as I'm very new to the Java
world...:)

I am running the following:

Fedora Core 4
Apache Tomcat 5.5.16 (binary download from Tomcat site installed to
/usr/local/tomcat5)
jre1.5.0_06 (binary download from Sun site to /usr/java/jre1.5.0_06)

Weird though - when I try to do a java -v I get this now:

[EMAIL PROTECTED] jre1.5.0_06]# export JAVA_HOME=/usr/java/jre1.5.0_06/
[EMAIL PROTECTED] jre1.5.0_06]# /usr/java/jre1.5.0_06/bin/java -v
Unrecognized option: -v Could not create the Java virtual machine.

Is this my actual problem possibly? Or is this the wrong Java version to
be running? When I downloaded 1.4.x tomcat told me it didn't support
anything but 1.5.x 

Thanks again for your patience...
Paul


-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 06, 2006 7:16 AM
To: nutch-user@lucene.apache.org
Subject: Re: Nutch 500 Error

What version are you on? If you trace the NullPointerException back to
the code, the NutchBean.init method is where it expects to find the
index and segments, so either they're missing (did you follow the
tutorial and merge your segment indexes?) or it is looking in the wrong
place. That's what I think.

Rgrds, Thomas



On 4/6/06, Paul Stewart
wrote:
 Thanks.. Tried that ... Same error

 HTTP Status 500 -

 --
 --
 

 type Exception report

 message

 description The server encountered an internal error () that prevented

 it from fulfilling this request.

 exception

 org.apache.jasper.JasperException

 org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServ
 le
 tWrapper.java:510)

 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.
 ja
 va:393)

 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31
 4)

 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)


 root cause

 java.lang.NullPointerException
 org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)
 org.apache.nutch.searcher.NutchBean.(NutchBean.java:82)
 org.apache.nutch.searcher.NutchBean.(NutchBean.java:72)
 org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)
 org.apache.jsp.search_jsp._jspService(search_jsp.java:112)

 org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.
 ja
 va:332)

 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31
 4)

 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

 -Original Message-
 From: TDLN [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 06, 2006 3:30 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Nutch 500 Error

 My guess is you have to override the searcher.dir property in 
 nutch-site.xml and have it point to your crawl dir.

 Rgrds, Thomas







  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


   
-
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low
rates.




  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Love cheap thrills? Enjoy PC-to-Phone  calls to 30+ countries for just 2¢/min 
with Yahoo! Messenger with Voice.

RE: Nutch 500 Error

2006-04-06 Thread sudhendra seshachala
It should be java -versionI think.

Paul Stewart [EMAIL PROTECTED] wrote:  Thanks for the reply... I apologize as 
I'm very new to the Java
world...:)

I am running the following:

Fedora Core 4
Apache Tomcat 5.5.16 (binary download from Tomcat site installed to
/usr/local/tomcat5)
jre1.5.0_06 (binary download from Sun site to /usr/java/jre1.5.0_06)

Weird though - when I try to do a java -v I get this now:

[EMAIL PROTECTED] jre1.5.0_06]# export JAVA_HOME=/usr/java/jre1.5.0_06/
[EMAIL PROTECTED] jre1.5.0_06]# /usr/java/jre1.5.0_06/bin/java -v
Unrecognized option: -v
Could not create the Java virtual machine.

Is this my actual problem possibly? Or is this the wrong Java version
to be running? When I downloaded 1.4.x tomcat told me it didn't support
anything but 1.5.x 

Thanks again for your patience...
Paul


-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 06, 2006 7:16 AM
To: nutch-user@lucene.apache.org
Subject: Re: Nutch 500 Error

What version are you on? If you trace the NullPointerException back to
the code, the NutchBean.init method is where it expects to find the
index and segments, so either they're missing (did you follow the
tutorial and merge your segment indexes?) or it is looking in the wrong
place. That's what I think.

Rgrds, Thomas



On 4/6/06, Paul Stewart 
wrote:
 Thanks.. Tried that ... Same error

 HTTP Status 500 -

 --
 --
 

 type Exception report

 message

 description The server encountered an internal error () that prevented

 it from fulfilling this request.

 exception

 org.apache.jasper.JasperException

 org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServ
 le
 tWrapper.java:510)

 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.
 ja
 va:393)

 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31
 4)

 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)


 root cause

 java.lang.NullPointerException
 org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)
 org.apache.nutch.searcher.NutchBean.(NutchBean.java:82)
 org.apache.nutch.searcher.NutchBean.(NutchBean.java:72)
 org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)
 org.apache.jsp.search_jsp._jspService(search_jsp.java:112)

 org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.
 ja
 va:332)

 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31
 4)

 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

 -Original Message-
 From: TDLN [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 06, 2006 3:30 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Nutch 500 Error

 My guess is you have to override the searcher.dir property in 
 nutch-site.xml and have it point to your crawl dir.

 Rgrds, Thomas







  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.

Re: Crawling the local file system with Nutch - Document-

2006-04-04 Thread sudhendra seshachala
I just modified search.jsp. Basically set the content type based on document 
type I was querying.
  Rest is handled protocol and browser.
   
  I can send the code if you would like.
   
  Thanks

kauu [EMAIL PROTECTED] wrote:
  thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.



On 4/1/06, Vertical Search wrote:

 Nutchians,
 I have tried to document the sequence of steps to adopt nutch to crawl and
 search local file system on windows machine.
 I have been able to do it successfully using nutch 0.8 Dev
 The configuration are as follows
 *Inspiron 630m
 Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
 Windows XP
 Professional)*
 *If some can review it, it will be very helpful.*

 Crawling the local filesystem with nutch
 Platform: Microsoft / nutch 0.8 Dev
 For a linux version, please refer to
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 The link did help me get it off the ground.

 I have been working on adopting nutch in a vertical domain. All of a
 sudden,
 I was asked to develop a proof of concept
 to adopt nutch to crawl and search local file syste,
 Initially I did face some problems. But some mail archieves did help me
 proceed further.
 The intention is to provide a overview of steps to crawl local file
 systems
 and search through the browser.

 I downloaded the nuctch nightly from
 1. Create the environment variable such as NUTCH_HOME. (Not mandatory,
 but
 helps)
 2. Extract the downloaded nightly build. 
 3. Create a folder -- c:/LocalSearch -- copied the following folders and
 librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ 
 5. Plugins folder
 4. Modify the nutch-site.xml to include the Plugin folder
 5. Modify the nutch-site.xml to include the includes. An example is as
 follows

 
 
 
 
 

 plugin.includes

 protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
 

 

 file.content.limit -1
 

 

 6. Modify crawl-urlfilter.txt
 Remember we have to crawl the local file system. Hence we have to modify
 the
 entries as follows

 #skip http:, ftp:,  mailto: urls
 ##-^(file|ftp|mailto):

 -^(http|ftp|mailto):

 #skip image and other suffixes we can't yet parse

 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

 #skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 #accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 #accecpt anything else
 +.*

 7. urls folder
 Create a file for all the urls to be crawled. The file should have the
 urls
 as below
 save the file under the urls folder.

 The directories should be in file:// format. Example entries were as
 follows

 file://c:/resumes/word 
 file://c:/resumes/pdf 

 #file:///data/readings/semanticweb/

 Nutch recognises that the third line does not contain a valid file-url and
 skips it

 As suggested by the link
 8. Ignoring the parent directories. As suggested in the linux flavor of
 local fs crawl, I did modify the code in
 org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
 java.io.File f).

 I changed the following line:

 this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
 true);
 to

 this.content = list2html(f.listFiles(), path, false);
 and recompiled.

 9. Compile the changes. Just compiled the whole source code base. did not
 take more than 2 minutes.

 10. Crawling the file system.
 on my desktop, I have a short cut to cygdrive, double click
 pwd.
 cd ../../cygdrive/c/$NUTCH_HOME

 Execute
 bin/nutch crawl urls -dir c:/localfs/database

 Voila, that is it, After 20 minutes, the files were indexed, merged and
 all
 done.

 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
 folder

 Opened the nutch-site.xml and added the following snippet to reflect the
 search folder
 

 searcher.dir
 c:/localfs/database
 
 Path to root of crawl. This directory is searched (in
 order) for either the file search-servers.txt, containing a list of
 distributed search servers, or the directory index containing
 merged indexes, or the directory segments containing segment
 indexes.
 
 


 12. Searching locally was a bit slow. So I changed the hosts.ini file to
 map
 machine name to localhost. That increased search considerably.

 13. Modified the search.jsp and cached servlet to view word and pdf as
 demanded by user seamlessly.


 I hope this helps folks who are trying to adopt nutch for local file
 system.
 Personally I believe corporates should adopt nutch rather buying google
 appliance :)




--
www.babatu.com



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.

RE: Problems Installing

2006-04-02 Thread sudhendra seshachala
REname the file as ROOT.war (all upper case)
Then, http://localhost:8080 should work

Paul Stewart [EMAIL PROTECTED] wrote: Thanks for the reply...

I re-did what you mentioned below It re-installed just fine (I'm
running Fedora Core 4 and installed with yum using rpm's)

Even when I rename it, I must access it now via
http://www.myserver..:8080/root

Or else I get a 404 not found...  

When I try and do a search I get the same error

Any other thoughts? :)

Paul

-Original Message-
From: Dan Morrill [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 02, 2006 2:17 PM
To: nutch-user@lucene.apache.org
Subject: RE: Problems Installing

Did you:

1. remove the root.war from tomcat?
2. rename nutch.war to root.war and dump that into webapps under tomcat?
3. did it install ok (can you see the exploded pages under webapps root?

Just checking, this is how I fixed the same issue under windows. 

r/d

-Original Message-
From: Paul Stewart [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 02, 2006 11:00 AM
To: nutch-user@lucene.apache.org
Subject: Problems Installing

Hi there...

I am trying to get nutch running Have done a trial indexing run
successfully etc...

Now I'm running into issues that may be more Tomcat related than Nutch:

HTTP Status 500 - 




type Exception report

message 

description The server encountered an internal error () that prevented
it from fulfilling this request.

exception 

org.apache.jasper.JasperException
 
org.apache.jasper.servlet.JspServletWrapper.service(javax.servlet.http.H
ttpServletRequest, javax.servlet.http.HttpServletResponse, boolean)
(/usr/lib/libjasper5-compiler-5.0.30.jar.so)
 
org.apache.jasper.servlet.JspServlet.serviceJspFile(javax.servlet.http.H
ttpServletRequest, javax.servlet.http.HttpServletResponse,
java.lang.String, java.lang.Throwable, boolean)
(/usr/lib/libjasper5-compiler-5.0.30.jar.so)
 
org.apache.jasper.servlet.JspServlet.service(javax.servlet.http.HttpServ
letRequest, javax.servlet.http.HttpServletResponse)
(/usr/lib/libjasper5-compiler-5.0.30.jar.so)
 
javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest,
javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so)
 
org.apache.catalina.valves.ErrorReportValve.invoke(org.apache.catalina.R
equest, org.apache.catalina.Response, org.apache.catalina.ValveContext)
(/usr/lib/libcatalina-5.0.30.jar.so)
 
org.apache.coyote.tomcat5.CoyoteAdapter.service(org.apache.coyote.Reques
t, org.apache.coyote.Response) (/usr/lib/libcatalina-5.0.30.jar.so)
 
org.apache.coyote.http11.Http11Processor.process(java.io.InputStream,
java.io.OutputStream) (/usr/lib/libtomcat-http11-5.0.30.jar.so)
 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processC
onnection(org.apache.tomcat.util.net.TcpConnection, java.lang.Object[])
(/usr/lib/libtomcat-http11-5.0.30.jar.so)
 
org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[])
(/tmp/libtomcat-util-5.0.30.jar.socuf3wu.so)
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
(/tmp/libtomcat-util-5.0.30.jar.socuf3wu.so)
 java.lang.Thread.run() (/usr/lib/libgcj.so.6.0.0)


root cause 

java.lang.NullPointerException
 org.apache.nutch.searcher.NutchBean.init(java.io.File,
java.io.File) (Unknown Source)
 org.apache.nutch.searcher.NutchBean.NutchBean(java.io.File)
(Unknown Source)
 org.apache.nutch.searcher.NutchBean.NutchBean() (Unknown Source)
 
org.apache.nutch.searcher.NutchBean.get(javax.servlet.ServletContext)
(Unknown Source)
 
org.apache.jsp.search_jsp._jspService(javax.servlet.http.HttpServletRequ
est, javax.servlet.http.HttpServletResponse) (Unknown Source)
 
org.apache.jasper.runtime.HttpJspBase.service(javax.servlet.http.HttpSer
vletRequest, javax.servlet.http.HttpServletResponse)
(/usr/lib/libjasper5-runtime-5.0.30.jar.so)
 
javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest,
javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so)
 
org.apache.jasper.servlet.JspServletWrapper.service(javax.servlet.http.H
ttpServletRequest, javax.servlet.http.HttpServletResponse, boolean)
(/usr/lib/libjasper5-compiler-5.0.30.jar.so)
 
org.apache.jasper.servlet.JspServlet.serviceJspFile(javax.servlet.http.H
ttpServletRequest, javax.servlet.http.HttpServletResponse,
java.lang.String, java.lang.Throwable, boolean)
(/usr/lib/libjasper5-compiler-5.0.30.jar.so)
 
org.apache.jasper.servlet.JspServlet.service(javax.servlet.http.HttpServ
letRequest, javax.servlet.http.HttpServletResponse)
(/usr/lib/libjasper5-compiler-5.0.30.jar.so)
 
javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest,
javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so)
 
org.apache.catalina.valves.ErrorReportValve.invoke(org.apache.catalina.R
equest, org.apache.catalina.Response, org.apache.catalina.ValveContext)
(/usr/lib/libcatalina-5.0.30.jar.so)
 

Re: nutch config setup to crawl/query for word/pdf files

2006-03-29 Thread sudhendra seshachala
 OOPS, my bad. I was seeing 0.8 Dev. 

Michael Ji [EMAIL PROTECTED] wrote: hi Sudhendra:

I didn't see a file with such name
(parse-plugins.xml)in nutch/conf/ folder;

Should I create it by myself? Any tutorial I could
follow to set it up?

thanks,

Michael,

--- sudhendra seshachala  wrote:

 Have you checked parse-plugins.xml in conf/
 
 Thanks
 
 
   Sudhi Seshachala
   http://sudhilogs.blogspot.com/

 
 

 -
 Yahoo! Messenger with Voice. PC-to-Phone calls for
 ridiculously low rates.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls.  Great rates 
starting at 1cent;/min.

Re: Removing urls from webdb

2006-03-22 Thread sudhendra seshachala
I guess the problem is with the package name 
  src/java/org/apache/nutch/tools.PruneDB and
  src/java/org/apache/nutch/toos.PruneDB...
   
  Can you please verify again. It seems to be a typo mistake
   
  Thanks 

keren nutch [EMAIL PROTECTED] wrote:
  Hi Matt,

Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run 
ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the 
error:
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/nutch/tools/PruneDB

Please let me know where I'm wrong.

Keren

Matt Kangas wrote: I'm puzzled by the claim that It takes ~4 hours to remove a 
url from 
the webdb.. If you're removing them one at a time, yes, because you 
have to rewrite the entire webdb for any change. But you want to 
process them in bulk. So it should only take:
= (time to rewrite webdb) + (time to process 11M urls through 
URLFilter chain)
= 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need 
RegexURLFilter with two patterns defined. (a minus for a bad site, 
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of 
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

 Actually, we have 11,000,000 urls in the webdb.

 Keren

 Insurance Squared Inc. wrote: We've 
 got a website that is causing our crawler to slow down (from
 20mbits down to 3-5) - 400K pages that are basically not available,
 we're just getting 404's. I'd like to remove them from the DB to get
 our crawl speed back up again.

 Here's what our developer told me - I'm stumped, that seems really 
 odd.
 Is there a better way to remove a URL so that it doesn't get crawled?

 Running nutch 0.71 on a dual xeon with 8 gigs of ram.

 -
 There are more than 400,000 urls in the webdb. It takes ~4 hours
 to remove a url from the webdb. That means that it'll take ~1,600,000
 hours (~66,666 days, or ~ months, ~185 years) to remove 400,000 
 CAA
 urls from the webdb. Do you really want to remove them in this way?


--
Matt Kangas / [EMAIL PROTECTED]






-
Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers 


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Re: crawling pdf and word file

2006-03-22 Thread sudhendra seshachala
In Nutch-default.xml,
Include plugin for word and PDF as below.

property
  nameplugin.includes/name
  
valueprotocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
/property
But reco is to include the property in nutch-site.xml

Hope this helps.

Michael Ji [EMAIL PROTECTED] wrote: 
hi there,

Is there any specific setting need to be added in
configuration file in order to crawl and index pdf and
word file?

thanks,

Michael,

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! 
Messenger with Voice.

Crawling sites with Encoded URLs

2006-03-08 Thread sudhendra seshachala
Hi
  I have been trying to crawl sites with url encoded. But am trying to escape 
characters in crawl-urlfilter.txt.
  For some reason, does not seem to be working... One solution is to extend the 
crawler, .. if there are any other options ? :) Please let me know..
   
   
  Thanks
  Sudhi


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Multi dimensional searches

2006-03-06 Thread sudhendra seshachala
I have been using nutch for learning purpose as to how it works so far. I have 
been fairly successful in actually getting it up and running for some sites on 
my local machine.
   
  I sincerely thank the vibrant group helping me and many others..
   
  I have some questions or issues, however you all might want to percieve.
   
  The idea is to build a niche searh, based on some parameters such as location 
(city, state, zip code and radius).
  I believe Nutch is fabulous way to build upon a location based search.
  Again, the location based search is just one dimension. There are other 
dimensions as well..
  I noticed there is a GeoPosition plugin.. Has any one used this plugin in US..
  Just wanted to see, how i could re-use the framework.
  Further more, has any built  a two dimensional search?
  For instance, some one searching Hotels
  should get all the hotels globally.
  But some one searching hotels in San Jose, CA
  Should get those hotels located in the city of San Jose only
  Then some one searching hotels in San Jose 95129
  should get only hotels located in that area,,, and 5-10 mile radius by 
default.
   
  I can always write a unidimensional search like just hotels.
   
  I crawl the Hotels datbase and get it indexed.. (some filterng based in which 
I do not want to be filtered.
   
  If I have to also build the other search dimenstions, is there a rule of book 
to follow, has any one done it before?
   
  Any kind of insights or thoughts would be very helpful.
   
  Thanks
  Sudhi
   
   
   


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Re: project vitality?

2006-03-03 Thread sudhendra seshachala
I could not agree with Doug more. This is one of the best.. am trying UIMA 
too... though UIMA also uses Lucene...as of today, it is still a framework and 
community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of nightly 
build..
   
  Doug, 
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on 
releasing 0.8.
   
  Thanks
  Sudhi
  

Doug Cutting [EMAIL PROTECTED] wrote:
  Richard Braman wrote:
 I think it is still very much at proof of concept stage. I think it is
 close, but as you have mentioned, the website Is severely out of date
 and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks luster the project 
must be dead! Seriously, this is an active project. It is not yet 1.0, 
so don't expect polish. If it doesn't look easily usable to you then 
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch. Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

 I have tried
 to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug 
report, so that other folks are aware of it. Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Re: Exception from crawl command

2006-03-02 Thread sudhendra seshachala
Okay.
  Have you tried, the 0.8 version. Seems like it is more stable than the 0.7.X. 
(The one you are using)
  It is a bit different too.. with Hadoop and nutch being separate..
  I had few issues using 0.7X. But nightly-build (0.8), I was upto speed 
comparatively sooner.
   
  I hope this helps.. I am not trying to go away from the problem, just that 
next release is more stable and more ever, there is no backward compatibility 
for 0,8X. (That is what I read in one of the mails achieve) You are better off 
using 0.8..
   
  Thanks
  Sudhi
   
  

[EMAIL PROTECTED] wrote:
  Hi,

sorry for the fumbled reply, I've tried deleting the
directory and starting the crawl from scratch a number
of times, with very similar results.

The system seems to be generating the exception after
the fetch block of the output after an apparently
arbitrary depth. It leaves the directory with a db
folder containing:

Mar 2 09:30 dbreadlock
Mar 2 09:31 dbwritelock
Mar 2 09:30 webdb
Mar 2 09:31 webdb.new

The webdb.new folder contains:

Mar 2 09:30 pagesByURL
Mar 2 09:30 stats
Mar 2 09:31 tmp

I have the following set in my nutch-site.xml file:



urlnormalizer.class

org.apache.nutch.net.RegexUrlNormalizer
Name of the class used to normalize
URLs.





urlnormalizer.regex.file
regex-normalize.xml
Name of the config file used by the
RegexUrlNormalizer class.





http.content.limit
-1
The length limit for downloaded
content, in bytes.
If this value is nonnegative (=0), content longer
than it will be truncated;
otherwise, no truncation at all.






plugin.includes

nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url)
Regular expression naming plugin
directory names to
include. Any plugin not matching this expression is
excluded.
In any case you need at least include the
nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain
text via HTTP,
and basic indexing and search plugins.




I don't think any of this should cause the problem. 
I'm going to try reinstalling and setting everything
up again, but if anyone has any idea what the problem
might be then please let me know.

cheers,


Julian.


--- sudhendra seshachala wrote:

 Delete the folder/database and then re-issue the
 crawl command.
 The database/folder gets created when Crawl is
 used. 
 I am recent user too... But, I did get the same
 message and I corrected by deleting the folder. IF
 any one has better ideas, please share.
 
 Thanks
 
 [EMAIL PROTECTED] wrote:
 Hi,
 
 I've been experimenting with nutch and lucene,
 everything was working fine, but now I'm getting an
 exception thrown from the crawl command.
 
 The command manages a few fetch cycles but then I
 get
 the following message:
 
 060301 161128 status: segment 20060301161046, 38
 pages, 0 errors, 856591 bytes, 41199 ms
 060301 161128 status: 0.92235243 pages/s, 162.43396
 kb/s, 22541.87 bytes/page
 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
 060301 161129 Updating for
 C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
 060301 161129 Processing document 0
 060301 161130 Finishing update
 060301 161130 Processing pagesByURL: Sorted 952
 instructions in 0.02 seconds.
 060301 161130 Processing pagesByURL: Sorted 47600.0
 instructions/second
 java.io.IOException: already exists:
 C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
 at
 org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
 at

org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
 at

org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
 at

org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
 at

org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
 at

org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
 Exception in thread main 
 
 Does anyone have any ideas what the problem is
 likely
 to be. I am running nutch 0.7.1
 
 thanks,
 
 
 Julian.
 
 
 
 Sudhi Seshachala
 http://sudhilogs.blogspot.com/
 
 
 
 
 -
 Yahoo! Mail
 Use Photomail to share photos without annoying
attachments.




  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Running the crawl.. can any one point me to step by step guide ?

2006-02-28 Thread sudhendra seshachala
I built the nightlly build after creating the folders. 
  But when I run on crawl, I get the following errors. I am using cygwin I 
am not able to figure out what input is missing..., can any one help ?
  $ bin/nutch crawl urls.txt -dir c:/SearchEngine/Database
060228 100707 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/hadoop-default.xml
060228 100707 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-default.xml
  060228 100707 parsing file:/C:/SearchEngine/nutch-nightly/conf/crawl-tool.xml
060228 100707 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100707 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-site.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml
060228 100708 crawl started in: c:\SearchEngine\Database
060228 100708 rootUrlDir = urls.txt
060228 100708 threads = 10
060228 100708 depth = 5
060228 100708 Injector: starting
060228 100708 Injector: crawlDb: c:\SearchEngine\Database\crawldb
060228 100708 Injector: urlDir: urls.txt
060228 100708 Injector: Converting injected urls to crawl db entries.
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/hadoop-default.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-default.xml
  060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/crawl-tool.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-site.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/hadoop-default.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-default.xml
  060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/crawl-tool.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-site.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml
060228 100708 Running job: job_ofko1u
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/hadoop-default.xml
060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev
.jar!/mapred-default.xml
060228 100708 parsing c:\SearchEngine\Database\local\localRunner\job_ofko1u.xml
060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml
java.io.IOException: No input directories specified in: Configuration: defaults:
 hadoop-default.xml , mapred-default.xml , c:\SearchEngine\Database\local\localR
unner\job_ofko1u.xmlfinal: hadoop-site.xml
at org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.ja
va:84)
at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.ja
va:94)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:7
0)
060228 100709  map 0%  reduce 0%
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Nutch 0.8 -building WAR file

2006-02-27 Thread sudhendra seshachala
Hi there,
I got the nightly build and if I try to run Ant war I get the following error
BUILD FAILED
C:\kool\nutch-nightly\build.xml:94: The following error occurred while executing
 this line:
C:\kool\nutch-nightly\src\plugin\build.xml:9: The following error occurred while
 executing this line:
C:\kool\nutch-nightly\src\plugin\clustering-carrot2\build.xml:26: The following
error occurred while executing this line:
C:\kool\nutch-nightly\src\plugin\build-plugin.xml:97: srcdir C:\kool\nutch-nigh
tly\src\plugin\nutch-extensionpoints\src\java does not exist!

I guess am mising something. Can some one point me to exact direction, where I 
can get the missing things.


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Whole Web Indexing

2006-02-23 Thread sudhendra seshachala
IS invertlinks supported or not ? I am using nutch 0.7.1. I am getting no class 
def found error. or should I use a compiled version.. Can some help me here ?  
Whole-web: Indexing  Before indexing we first invert all of the links, so that 
we may index incoming anchor text with the pages.

bin/nutch invertlinks crawl/linkdb crawl/segments
  To index the segments we use the index command, as follows:

bin/nutch index indexes crawl/linkdb crawl/segments/*


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Relax. Yahoo! Mail virus scanning helps detect nasty viruses!

Nutch 0.8 version required..

2006-02-23 Thread sudhendra seshachala
The latest version I could see in the SVN is 0.7.1,
  Where can I get 0.8., source code is even better.
  Could I just grab from nightly builds ?
   
  Please let me know..
   
  Thanks





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
 Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews,  more on new 
and used cars.

Re: Nutch 0.8 version required..

2006-02-23 Thread sudhendra seshachala
Thanks Stefan.
But when I compiled, the jar size was just 318kB for 0.8-dev where as the 0.7.1 
release was 718KB. 
Am I missing something ?

Sudhi
Stefan Groschupf [EMAIL PROTECTED] wrote: 
http://cvs.apache.org/dist/lucene/nutch/nightly/

Am 24.02.2006 um 01:44 schrieb sudhendra seshachala:

 The latest version I could see in the SVN is 0.7.1,
   Where can I get 0.8., source code is even better.
   Could I just grab from nightly builds ?

   Please let me know..

   Thanks





   Sudhi Seshachala
   http://sudhilogs.blogspot.com/



   
 -
  Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews,   
 more on new and used cars.

-
blog: http://www.find23.org
company: http://www.media-style.com





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
 Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews,  more on new 
and used cars.

Nutch and HTTrack Crawler

2006-02-22 Thread sudhendra seshachala
Is there a way I could use HTTrack for crawling and nutch for just searching?
 Has any body done this before andcomparision between crawlers.
 
 How easy or tough is it to customize nutch crawler for a specific vertical ?
 I know a crude way of writing a crawler, but was wondering if any one has 
actually done a custom crawler..

Thanks
Sudhi




  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
 Yahoo! Mail
 Use Photomail to share photos without annoying attachments.