RE: Merging indexes -- please help....

2006-04-04 Thread Olive g

Hi,

I encountered the same problem on 0.8. See my post 
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
Anyone has any idea? Is it a bug or a configuration issue? Please let me 
know.

Thanks.

Olive


From: Dan Morrill [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: RE: Merging indexes -- please help
Date: Mon, 3 Apr 2006 05:18:34 -0700

Hi,

I noticed that when I used the drive designation that it didn't like that
(windows cygwin environment) if you did

./nutch merge -local /STG1/index /STG1/indexes that may work better, let me
know.

Cheers/r/dan
H
-Original Message-
From: Vertical Search [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 02, 2006 7:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: Merging indexes -- please help

Okay.
I had 2 sets of crawl
such as E:/STG1 and E/STG2
I used the dedup command to remove duplicates
Then I the command i used to merge is as follows
based on what have been available on mail archieves and responses I got

First I can

 bin/nutch merge E:/STG1/index E:/STG1/indexes
  bin/nutch merge E:/STG1/index E:/STG2/indexes

In the nutch-site .xml I have searcher.dir ad E:/STG1

I get the absolutely no results...The command console is as follows.
Can some one shed some light on this please ASAP..

INFO: creating new bean
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening merged index in E:\Hoodukoo\STG5\index
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening segments in E:\Hoodukoo\STG5\segments
Apr 2, 2006 8:58:36 PM
org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
der
INFO: found resource common-terms.utf8 at
file:/C:/xampp/tomcat/webapps/hoodukoo
/WEB-INF/classes/common-terms.utf8
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
INFO: query request from 127.0.0.1
Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
INFO: query: site
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
INFO: searching for 20 raw hits



_
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




Meta-Refresh Question

2006-04-04 Thread Dennis Kubes
Silly question but nutch won't follow meta-refreshes will it?

Dennis




Re: Merging indexes -- please help....

2006-04-04 Thread Vertical Search
Sorry. I too have faced the same problem.. I am in process of releasing for
a demo  (mangement) over this weekend.
I will try to work on merging stuff after that... IT is a very important
part and have to get it to work, if I have to succeed in adopting Nutch for
a vertical domain.
Further more. I could not get the PruneIndexTool up and running.
It asks for query. I wonder if some can share the query file or format, the
tool expects.

But goes without saying.. I am very thankful for folks here extending the
help.

Thanks



On 4/4/06, Olive g [EMAIL PROTECTED] wrote:

 Hi,

 I encountered the same problem on 0.8. See my post
 http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
 Anyone has any idea? Is it a bug or a configuration issue? Please let me
 know.
 Thanks.

 Olive

 From: Dan Morrill [EMAIL PROTECTED]
 Reply-To: nutch-user@lucene.apache.org
 To: nutch-user@lucene.apache.org
 Subject: RE: Merging indexes -- please help
 Date: Mon, 3 Apr 2006 05:18:34 -0700
 
 Hi,
 
 I noticed that when I used the drive designation that it didn't like that
 (windows cygwin environment) if you did
 
 ./nutch merge -local /STG1/index /STG1/indexes that may work better, let
 me
 know.
 
 Cheers/r/dan
 H
 -Original Message-
 From: Vertical Search [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 02, 2006 7:07 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Merging indexes -- please help
 
 Okay.
 I had 2 sets of crawl
 such as E:/STG1 and E/STG2
 I used the dedup command to remove duplicates
 Then I the command i used to merge is as follows
 based on what have been available on mail archieves and responses I got
 
 First I can
 
   bin/nutch merge E:/STG1/index E:/STG1/indexes
bin/nutch merge E:/STG1/index E:/STG2/indexes
 
 In the nutch-site .xml I have searcher.dir ad E:/STG1
 
 I get the absolutely no results...The command console is as follows.
 Can some one shed some light on this please ASAP..
 
 INFO: creating new bean
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
 INFO: opening merged index in E:\Hoodukoo\STG5\index
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
 INFO: opening segments in E:\Hoodukoo\STG5\segments
 Apr 2, 2006 8:58:36 PM
 org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
 der
 INFO: found resource common-terms.utf8 at
 file:/C:/xampp/tomcat/webapps/hoodukoo
 /WEB-INF/classes/common-terms.utf8
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
 INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
 INFO: query request from 127.0.0.1
 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
 INFO: query: site
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
 INFO: searching for 20 raw hits
 

 _
 Express yourself instantly with MSN Messenger! Download today - it's FREE!
 http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




Re: Merging indexes -- please help....

2006-04-04 Thread Olive g

We too have deadlines :(.

I would appreciate it very much if someone can provide more insight. Is it a 
bug or
configuration issue? How can we even do incremental crawsl on 0.8 with these 
issues?


Should I send email to the developer mailing list? Would that help?

Gurus, please help 




From: Vertical Search [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: Merging indexes -- please help
Date: Tue, 4 Apr 2006 10:11:51 -0500

Sorry. I too have faced the same problem.. I am in process of releasing for
a demo  (mangement) over this weekend.
I will try to work on merging stuff after that... IT is a very important
part and have to get it to work, if I have to succeed in adopting Nutch for
a vertical domain.
Further more. I could not get the PruneIndexTool up and running.
It asks for query. I wonder if some can share the query file or format, the
tool expects.

But goes without saying.. I am very thankful for folks here extending the
help.

Thanks



On 4/4/06, Olive g [EMAIL PROTECTED] wrote:

 Hi,

 I encountered the same problem on 0.8. See my post
 
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.

 Anyone has any idea? Is it a bug or a configuration issue? Please let me
 know.
 Thanks.

 Olive

 From: Dan Morrill [EMAIL PROTECTED]
 Reply-To: nutch-user@lucene.apache.org
 To: nutch-user@lucene.apache.org
 Subject: RE: Merging indexes -- please help
 Date: Mon, 3 Apr 2006 05:18:34 -0700
 
 Hi,
 
 I noticed that when I used the drive designation that it didn't like 
that

 (windows cygwin environment) if you did
 
 ./nutch merge -local /STG1/index /STG1/indexes that may work better, 
let

 me
 know.
 
 Cheers/r/dan
 H
 -Original Message-
 From: Vertical Search [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 02, 2006 7:07 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Merging indexes -- please help
 
 Okay.
 I had 2 sets of crawl
 such as E:/STG1 and E/STG2
 I used the dedup command to remove duplicates
 Then I the command i used to merge is as follows
 based on what have been available on mail archieves and responses I 
got

 
 First I can
 
   bin/nutch merge E:/STG1/index E:/STG1/indexes
bin/nutch merge E:/STG1/index E:/STG2/indexes
 
 In the nutch-site .xml I have searcher.dir ad E:/STG1
 
 I get the absolutely no results...The command console is as follows.
 Can some one shed some light on this please ASAP..
 
 INFO: creating new bean
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
 INFO: opening merged index in E:\Hoodukoo\STG5\index
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
 INFO: opening segments in E:\Hoodukoo\STG5\segments
 Apr 2, 2006 8:58:36 PM
 org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
 der
 INFO: found resource common-terms.utf8 at
 file:/C:/xampp/tomcat/webapps/hoodukoo
 /WEB-INF/classes/common-terms.utf8
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
 INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
 INFO: query request from 127.0.0.1
 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
 INFO: query: site
 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
 INFO: searching for 20 raw hits
 

 _
 Express yourself instantly with MSN Messenger! Download today - it's 
FREE!

 http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




_
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




Re: Merging indexes -- please help....

2006-04-04 Thread Zaheed Haque
You might want to try this but I am not sure if it works :-) Please
make backups before!! This is a work around..

I assume that you have two working index i.e CrawlA and CrawlB
(Ready to go and works like a charm via the browser :-). Ok I am
taking for granted that all directory like index, indexes, segments
etc are in the directory CrawlA and CrawlB

Now make a new directory called CrawlC

mkdir CrawlC
cd CrawlC
mkdir crawldb
cd crawldb
mkdir current
cd current

Now copy the

cp -r CrawlA/crawldb/current/part-0 to CrawlC/crawldb/current/part-0
cp -r CrawlB/crawldb/current/part-0 to CrawlC/crawldb/current/part-1

NOTE the part-1

Now make a directory segments under CrawlC
cd to CrawlC/segments

Now copy the

cp-r CrawlA/segments/* to CrawlC/segments/*
cp-r CrawlB/segments/* to CrawlC/segments/*

etc..

Now you should have under CrawlC two directory

crawldb
segments

Proceed with

- bin/nutch invertlinks linkdb segments/*
- bin/nutch index indexes crawldb linkdb segments/*
- bin/nutch dedup indexes
- bin/nutch merge index indexes

Change your searcher.dir in nutch-site.xml and give it a go..
Cheers

On 4/4/06, Olive g [EMAIL PROTECTED] wrote:
 We too have deadlines :(.

 I would appreciate it very much if someone can provide more insight. Is it a
 bug or
 configuration issue? How can we even do incremental crawsl on 0.8 with these
 issues?

 Should I send email to the developer mailing list? Would that help?

 Gurus, please help 



 From: Vertical Search [EMAIL PROTECTED]
 Reply-To: nutch-user@lucene.apache.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Merging indexes -- please help
 Date: Tue, 4 Apr 2006 10:11:51 -0500
 
 Sorry. I too have faced the same problem.. I am in process of releasing for
 a demo  (mangement) over this weekend.
 I will try to work on merging stuff after that... IT is a very important
 part and have to get it to work, if I have to succeed in adopting Nutch for
 a vertical domain.
 Further more. I could not get the PruneIndexTool up and running.
 It asks for query. I wonder if some can share the query file or format, the
 tool expects.
 
 But goes without saying.. I am very thankful for folks here extending the
 help.
 
 Thanks
 
 
 
 On 4/4/06, Olive g [EMAIL PROTECTED] wrote:
  
   Hi,
  
   I encountered the same problem on 0.8. See my post
  
 http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
   Anyone has any idea? Is it a bug or a configuration issue? Please let me
   know.
   Thanks.
  
   Olive
  
   From: Dan Morrill [EMAIL PROTECTED]
   Reply-To: nutch-user@lucene.apache.org
   To: nutch-user@lucene.apache.org
   Subject: RE: Merging indexes -- please help
   Date: Mon, 3 Apr 2006 05:18:34 -0700
   
   Hi,
   
   I noticed that when I used the drive designation that it didn't like
 that
   (windows cygwin environment) if you did
   
   ./nutch merge -local /STG1/index /STG1/indexes that may work better,
 let
   me
   know.
   
   Cheers/r/dan
   H
   -Original Message-
   From: Vertical Search [mailto:[EMAIL PROTECTED]
   Sent: Sunday, April 02, 2006 7:07 PM
   To: nutch-user@lucene.apache.org
   Subject: Re: Merging indexes -- please help
   
   Okay.
   I had 2 sets of crawl
   such as E:/STG1 and E/STG2
   I used the dedup command to remove duplicates
   Then I the command i used to merge is as follows
   based on what have been available on mail archieves and responses I
 got
   
   First I can
   
 bin/nutch merge E:/STG1/index E:/STG1/indexes
  bin/nutch merge E:/STG1/index E:/STG2/indexes
   
   In the nutch-site .xml I have searcher.dir ad E:/STG1
   
   I get the absolutely no results...The command console is as follows.
   Can some one shed some light on this please ASAP..
   
   INFO: creating new bean
   Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
   INFO: opening merged index in E:\Hoodukoo\STG5\index
   Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
   INFO: opening segments in E:\Hoodukoo\STG5\segments
   Apr 2, 2006 8:58:36 PM
   org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
   der
   INFO: found resource common-terms.utf8 at
   file:/C:/xampp/tomcat/webapps/hoodukoo
   /WEB-INF/classes/common-terms.utf8
   Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
   INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
   Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
   INFO: query request from 127.0.0.1
   Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
   INFO: query: site
   Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
   INFO: searching for 20 raw hits
   
  
   _
   Express yourself instantly with MSN Messenger! Download today - it's
 FREE!
   http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
  
  

 _
 Express yourself instantly 

Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Andrzej Bialecki

Olive g wrote:

Hi Andrzej  other gurus who might be reading this message :-):

I ran some tests and somehow my query returned 0 hit against merged 
indexes. Here is my test case and it's a bit long, thank you in 
advance for your patience:


1. crawled the first 100 urls

  ~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1  
test1.log


2. set searcher.dir to test1

3. query for movie
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie

 it returned 64 hits (a web research with tomcat returned the same 
result)


4. crawled the second 100 urls

~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1  
test2.log


5. set searcher.dir to test2

6. query for movie
 ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
 it returned 55 hits (a web research with tomcat returned the same 
result)


7.  attempted to merge using the following command:
 ../search/bin/nutch merge test3 test1 test2  merge-test3
 returned error:
 Exception in thread main java.rmi.RemoteException: 
java.io.IOException: Cannot

open filename /user/root/test1/crawldb/segments
   at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120)

8.  attempted to merge again using the following command:
../search/bin/nutch merge test4 test1/indexes test2/indexes  
merge-test4

  merged successfully with no errors

9. set searcher.dir to test4

10.  query for movie by:
  ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
 and it returned 0 hit (a web research with tomcat returned the same 
result)


 060403 201545 10 opening segments in test4/segments
 060403 201545 10 found resource common-terms.utf8 at
 file:/root/nutch/search/conf/common-terms.utf8
 060403 201545 10 opening linkdb in test4/linkdb
 Total hits: 0

It appeared to be looking for test4/segments and test4/linkdb which 
did not exist?


Well, the short answer is that you cannot at the moment merge crawldbs 
or linkdbs. As a consequence, you cannot use multiple outputs of 'nutch 
crawl' together (because NutchBean needs to reference a single linkdb 
during searching).


This is technically possible, but simply not implemented (yet).

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Olive g
Thank you! Zaheed sent out a workaround in another thread as follows. Do you 
think this would

work (on Nutch 0.8 w/ DFS).

Also, when do you expect to port the feature to 0.8 (I know it's not the 
highest priority for
you :)) - but really, merging index is critical for incremental crawls. Is 
it possible that it can be

implemented sooner? Please ... Our project depends on this ...

Thanks again for your help!

Olive



From :  Zaheed Haque [EMAIL PROTECTED]

Reply-To :  nutch-user@lucene.apache.org
Sent :  Tuesday, April 4, 2006 4:12 PM
To :  nutch-user@lucene.apache.org
Subject :  Re: Merging indexes -- please help
Go to previous message | Go to next message | Delete | Inbox

You might want to try this but I am not sure if it works :-) Please
make backups before!! This is a work around..

I assume that you have two working index i.e CrawlA and CrawlB
(Ready to go and works like a charm via the browser :-). Ok I am
taking for granted that all directory like index, indexes, segments
etc are in the directory CrawlA and CrawlB

Now make a new directory called CrawlC

mkdir CrawlC
cd CrawlC
mkdir crawldb
cd crawldb
mkdir current
cd current

Now copy the

cp -r CrawlA/crawldb/current/part-0 to CrawlC/crawldb/current/part-0
cp -r CrawlB/crawldb/current/part-0 to CrawlC/crawldb/current/part-1

NOTE the part-1

Now make a directory segments under CrawlC
cd to CrawlC/segments

Now copy the

cp-r CrawlA/segments/* to CrawlC/segments/*
cp-r CrawlB/segments/* to CrawlC/segments/*

etc..

Now you should have under CrawlC two directory

crawldb
segments

Proceed with

- bin/nutch invertlinks linkdb segments/*
- bin/nutch index indexes crawldb linkdb segments/*
- bin/nutch dedup indexes
- bin/nutch merge index indexes

Change your searcher.dir in nutch-site.xml and give it a go..




From: Andrzej Bialecki [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: Query on merged indexes returned 0 hit - test case included 
(Nutch 0.8)

Date: Tue, 04 Apr 2006 18:29:07 +0200

Olive g wrote:

Hi Andrzej  other gurus who might be reading this message :-):

I ran some tests and somehow my query returned 0 hit against merged 
indexes. Here is my test case and it's a bit long, thank you in advance 
for your patience:


1. crawled the first 100 urls

  ~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1  
test1.log


2. set searcher.dir to test1

3. query for movie
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie

 it returned 64 hits (a web research with tomcat returned the same 
result)


4. crawled the second 100 urls

~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1  
test2.log


5. set searcher.dir to test2

6. query for movie
 ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
 it returned 55 hits (a web research with tomcat returned the same 
result)


7.  attempted to merge using the following command:
 ../search/bin/nutch merge test3 test1 test2  merge-test3
 returned error:
 Exception in thread main java.rmi.RemoteException: 
java.io.IOException: Cannot

open filename /user/root/test1/crawldb/segments
   at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120)

8.  attempted to merge again using the following command:
../search/bin/nutch merge test4 test1/indexes test2/indexes  
merge-test4

  merged successfully with no errors

9. set searcher.dir to test4

10.  query for movie by:
  ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
 and it returned 0 hit (a web research with tomcat returned the same 
result)


 060403 201545 10 opening segments in test4/segments
 060403 201545 10 found resource common-terms.utf8 at
 file:/root/nutch/search/conf/common-terms.utf8
 060403 201545 10 opening linkdb in test4/linkdb
 Total hits: 0

It appeared to be looking for test4/segments and test4/linkdb which did 
not exist?


Well, the short answer is that you cannot at the moment merge crawldbs or 
linkdbs. As a consequence, you cannot use multiple outputs of 'nutch crawl' 
together (because NutchBean needs to reference a single linkdb during 
searching).


This is technically possible, but simply not implemented (yet).

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




_
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




RE: nutch config setup to crawl/query for word/pdf files

2006-04-04 Thread Teruhiko Kurosaka
You don't need pdf|msword for index- and query-.
There are no such plugins.

-kuro


Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Andrzej Bialecki

Olive g wrote:
Thank you! Zaheed sent out a workaround in another thread as follows. 
Do you think this would

work (on Nutch 0.8 w/ DFS).



Yes, it should work. This is a cheap way to merge two DBs - thanks 
Zaheed! Just remember to rename the part-x dirs so that they are 
sequential.


Also, when do you expect to port the feature to 0.8 (I know it's not 
the highest priority for
you :)) - but really, merging index is critical for incremental 
crawls. Is it possible that it can be

implemented sooner? Please ... Our project depends on this ...



These features (incremental updates, merging indexes) are already 
supported if you use individual command-line tools and a single DB. So, 
I'm not planning to do anything about it.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: Meta-Refresh Question

2006-04-04 Thread Dennis Kubes
I searched through the code and the problem is the URL returned for the
meta-refresh is like this:

http://www.oneforever.com/tohomepage.do;jsessionid=F3C8BBAC224990A9214A1785E
5001AFD

Which matches the RegexURLFilter for this pattern:

[EMAIL PROTECTED] (because of the = sign

So my question is should the URL be cleaned up inside of the HttpBase where
it is grabbed from the page content or would it be better to put in a URL
filter to match before it gets eliminated by the filter above?

Dennis

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 04, 2006 9:56 AM
To: nutch-user@lucene.apache.org
Subject: Re: Meta-Refresh Question

Dennis Kubes wrote:
 Silly question but nutch won't follow meta-refreshes will it?


It should have, parse-html has support for this
(ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can see
that one of the necessary pieces (in Fetcher) didn't make it to 0.8.
Please create a JIRA issue so that it doesn't escape our attention.
Thank you!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com






Separate search and index servers?

2006-04-04 Thread Scott Simpson
I currently have Nutch 0.8 set up with two HDFS machines that store and
process searches and another machine that is both the HDFS index server
and the machine running Tomcat to run searches against. Is it possible to
separate the search machine from the index machine? I want to put the
index machine on a highly available HA cluster using the Linux Heartbeat
HA system since it always needs to be around. I then want to create a set of
search machines that a load balancer will feed searches to and these
machines will in turn send requests to the HDFS machines. Does this make
sense?

 



Re: Crawling a file but not indexing it

2006-04-04 Thread Benjamin Higgins
Okay, that sounds good.  Two questions:

* If I don't want to index a document, then from BasicIndexingFilter.filter,
should I just return the document I receive?  Or should I return null?  Or
something else?

* What change(s) do I have to make to HtmlParser?  It seems like I can use
the Parser object as-is, e.g. parse.getData().get(index) to get the
meta-data value for index.  What am I missing?

Thanks for the pointers!

Ben


On 4/3/06, TDLN [EMAIL PROTECTED] wrote:

 It depends if you control the seed pages or not; if you do, you could tag
 them index=no
 and skip them during indexing. You would have to change HtmlParser and
 BasicIndexingFilter.

 Rgrds, Thomas

 On 4/4/06, Benjamin Higgins [EMAIL PROTECTED] wrote:
 
  Hello,
 
  I've gone through the documentation and tried searching the mailing list
  archives.  I bet this has come up before, but I just couldn't find
  it.  So,
  if someone could point me to a past discussion that would be great.
 
  What I want to do is be able to crawl html files for links, but not
  actually
  index that file.  I ask this because I have several seed pages that are
  not
  meant for human consumption, so I never want them to show up in search
  results.
 
  How can this be accomplished?
 
  Thanks in advance,
 
  Ben
 
 




Re: Crawling the local file system with Nutch - Document-

2006-04-04 Thread sudhendra seshachala
I just modified search.jsp. Basically set the content type based on document 
type I was querying.
  Rest is handled protocol and browser.
   
  I can send the code if you would like.
   
  Thanks

kauu [EMAIL PROTECTED] wrote:
  thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.



On 4/1/06, Vertical Search wrote:

 Nutchians,
 I have tried to document the sequence of steps to adopt nutch to crawl and
 search local file system on windows machine.
 I have been able to do it successfully using nutch 0.8 Dev
 The configuration are as follows
 *Inspiron 630m
 Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
 Windows XP
 Professional)*
 *If some can review it, it will be very helpful.*

 Crawling the local filesystem with nutch
 Platform: Microsoft / nutch 0.8 Dev
 For a linux version, please refer to
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 The link did help me get it off the ground.

 I have been working on adopting nutch in a vertical domain. All of a
 sudden,
 I was asked to develop a proof of concept
 to adopt nutch to crawl and search local file syste,
 Initially I did face some problems. But some mail archieves did help me
 proceed further.
 The intention is to provide a overview of steps to crawl local file
 systems
 and search through the browser.

 I downloaded the nuctch nightly from
 1. Create the environment variable such as NUTCH_HOME. (Not mandatory,
 but
 helps)
 2. Extract the downloaded nightly build. 
 3. Create a folder -- c:/LocalSearch -- copied the following folders and
 librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ 
 5. Plugins folder
 4. Modify the nutch-site.xml to include the Plugin folder
 5. Modify the nutch-site.xml to include the includes. An example is as
 follows

 
 
 
 
 

 plugin.includes

 protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
 

 

 file.content.limit -1
 

 

 6. Modify crawl-urlfilter.txt
 Remember we have to crawl the local file system. Hence we have to modify
 the
 entries as follows

 #skip http:, ftp:,  mailto: urls
 ##-^(file|ftp|mailto):

 -^(http|ftp|mailto):

 #skip image and other suffixes we can't yet parse

 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

 #skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 #accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 #accecpt anything else
 +.*

 7. urls folder
 Create a file for all the urls to be crawled. The file should have the
 urls
 as below
 save the file under the urls folder.

 The directories should be in file:// format. Example entries were as
 follows

 file://c:/resumes/word 
 file://c:/resumes/pdf 

 #file:///data/readings/semanticweb/

 Nutch recognises that the third line does not contain a valid file-url and
 skips it

 As suggested by the link
 8. Ignoring the parent directories. As suggested in the linux flavor of
 local fs crawl, I did modify the code in
 org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
 java.io.File f).

 I changed the following line:

 this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
 true);
 to

 this.content = list2html(f.listFiles(), path, false);
 and recompiled.

 9. Compile the changes. Just compiled the whole source code base. did not
 take more than 2 minutes.

 10. Crawling the file system.
 on my desktop, I have a short cut to cygdrive, double click
 pwd.
 cd ../../cygdrive/c/$NUTCH_HOME

 Execute
 bin/nutch crawl urls -dir c:/localfs/database

 Voila, that is it, After 20 minutes, the files were indexed, merged and
 all
 done.

 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
 folder

 Opened the nutch-site.xml and added the following snippet to reflect the
 search folder
 

 searcher.dir
 c:/localfs/database
 
 Path to root of crawl. This directory is searched (in
 order) for either the file search-servers.txt, containing a list of
 distributed search servers, or the directory index containing
 merged indexes, or the directory segments containing segment
 indexes.
 
 


 12. Searching locally was a bit slow. So I changed the hosts.ini file to
 map
 machine name to localhost. That increased search considerably.

 13. Modified the search.jsp and cached servlet to view word and pdf as
 demanded by user seamlessly.


 I hope this helps folks who are trying to adopt nutch for local file
 system.
 Personally I believe corporates should adopt nutch rather buying google
 appliance :)




--
www.babatu.com



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.

stackoverflow

2006-04-04 Thread Rajesh Munavalli
Hi,
I am getting a stick over flow error when I run the CrawlTool with
a depth of 5. Is the depth too high and resulting in stackoverflow? Or am I
messing up some other parameters?. The URL file contains a single URL.

Thanks,
Rajesh


Re: Query on merged indexes returned 0 hit - more issues

2006-04-04 Thread Olive g

Hi gurus,

I tried the workaround and I found some more issues. It appears to me that 
inverlinks does not work properly with more than 5 input parts.


For example the following command (with number of map tasks set to 5 and the 
number of reduce tasks set to 5, using dfs, nutch 0.8)
../search/bin/nutch invertlinks test5/linkdb test5/segments/20060403192429 
test5/segments/20060403193814  linkdb-test5


generated basically the same error for all 5 reduce tasks:

java.rmi.RemoteException: java.io.IOException: Could not complete write to 
file /user/root/test5/linkdb/362527374/part-0/.data.crc


by DFSClient_441718647 at java.lang.Throwable.(Throwable.java:57) at 
java.lang.Throwable.(Throwable.java:68) at


org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)


the contents of test5/segments/20060403192429/content/ are

/user/root/test5/segments/20060403192429/content/part-0 123617
/user/root/test5/segments/20060403192429/content/part-1 141105
/user/root/test5/segments/20060403192429/content/part-2 168565
/user/root/test5/segments/20060403192429/content/part-3 179788
/user/root/test5/segments/20060403192429/content/part-4 70356

the contents of test5/segments/20060403193814/content/ are

/user/root/test5/segments/20060403193814/content/part-0 103014
/user/root/test5/segments/20060403193814/content/part-1 159010
/user/root/test5/segments/20060403193814/content/part-2 92892
/user/root/test5/segments/20060403193814/content/part-3 103847
/user/root/test5/segments/20060403193814/content/part-4 102626

In the example above there are 10 input parts in two segments. I noticed 
that this doesn't happen when there are no more than 5 input parts and it 
consistently happens when there are more than 5, even if they are in the 
same segment.


The urgency of this problem is that it prevents incremental crawling, 
whether by merging segments or by incremental depth crawling, because after 
5 more incremental crawls we have 6 parts.


Please let me know what you think.

Thank you!

Olive




From: Andrzej Bialecki [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: Query on merged indexes returned 0 hit - test case included 
(Nutch 0.8)

Date: Tue, 04 Apr 2006 19:20:43 +0200

Olive g wrote:
Thank you! Zaheed sent out a workaround in another thread as follows. Do 
you think this would

work (on Nutch 0.8 w/ DFS).



Yes, it should work. This is a cheap way to merge two DBs - thanks Zaheed! 
Just remember to rename the part-x dirs so that they are sequential.


Also, when do you expect to port the feature to 0.8 (I know it's not the 
highest priority for
you :)) - but really, merging index is critical for incremental crawls. Is 
it possible that it can be

implemented sooner? Please ... Our project depends on this ...



These features (incremental updates, merging indexes) are already supported 
if you use individual command-line tools and a single DB. So, I'm not 
planning to do anything about it.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




_
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




Re: Adaptive fetch

2006-04-04 Thread D . Saravanaraj
Hi,

Is the patch for Adaptive Refetch has been released? Considering intranet
and using nutch for  indexing large static HTML pages, i hope this feature
plays a crucial role. Please update me on this.

Thanks,
D.Saravanaraj

On 3/31/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Raghavendra Prabhu wrote:
  I believe we had a recent mail with problem of redirection also (with
 this
  patch applied..)
 
  And as you said  more people testing the patch would be better.
 
  Considering that this has the highest votes for add-on features, it is a
  critical one i guess.
 

 Ok, I'll bring this patch up to date over the weekend.

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





nutch-svn: Reduce operations: can't open map output

2006-04-04 Thread Shawn Gervais

Hi list,

I downloaded the nightly build of nutch+hadoop, and have been trying to 
get it working on a small cluster of machines.


I have it working properly on a single machine, however when I try to 
have my map and reduce tasks run on the cluster slaves, I get the 
following exception:


060405 000219 SEVERE Can't open map 
output:/home2/nutch/filesystem/mapreduce/local/part-2.out/task_m_1v749p
java.io.FileNotFoundException: 
/home2/nutch/filesystem/mapreduce/local/part-2.out/task_m_1v749p
at 
org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:114)

(rest of stack trace snipped)

Oddly enough, this map task ran on the same machine which produced the 
above error message. This it the output from the map task on the same 
machine:


060405 000210 task_m_1v749p  Child starting
060405 000211 Server connection on port 50050 from 127.0.0.1: starting
060405 000211 task_m_1v749p  Client connection to 0.0.0.0:50050: starting
060405 000211 task_m_1v749p  Client connection to 10.10.0.3:9000: starting
060405 000211 task_m_1v749p  Using URL normalizer: 
org.apache.nutch.net.BasicUrlNormalizer

060405 000211 Server connection on port 50050 from 127.0.0.1: starting
060405 000211 task_m_1v749p  Client connection to 0.0.0.0:50050: starting
060405 000211 task_m_1v749p 1.0% /user/nutch/urls/urllist.txt:2+2
060405 000211 Task task_m_1v749p is done.
(parsing lines snipped for brevity)


All map tasks finish with the output above, however none of my reduce 
tasks are finishing. The problem exists when the map task, and the 
corresponding reduce task which depends on the map's output, are run on 
the same machine or different machines. In both cases I see an IPC 
timeout Exception being thrown, 1 minute (60,000 ms, as specified in the 
hadoop-default.xml file) after the above FileNotFound exception is 
generated.


Does anyone have any pointers as to where I should look to determine the 
reason the map output is not being generated, or is not able to be accessed?


Regards,
-Shawn


RE: stackoverflow

2006-04-04 Thread Babu, KameshNarayana \(GE, Research, consultant\)
HI,
Rajesh,
will u be able to tell me about the deploying on Nutch. Thanks in advance.

-Original Message-
From: Rajesh Munavalli [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 05, 2006 3:48 AM
To: nutch-user@lucene.apache.org
Subject: stackoverflow


Hi,
I am getting a stick over flow error when I run the CrawlTool with
a depth of 5. Is the depth too high and resulting in stackoverflow? Or am I
messing up some other parameters?. The URL file contains a single URL.

Thanks,
Rajesh