[Nutch-general] Re: Merging indexes -- please help....

Zaheed Haque Tue, 04 Apr 2006 09:26:49 -0700

You might want to try this but I am not sure if it works :-) Please
make backups before!! This is a work around..


I assume that you have two working index i.e "CrawlA" and "CrawlB"
(Ready to go and works like a charm via the browser :-). Ok I am
taking for granted that all directory like index, indexes, segments
etc are in the directory "CrawlA" and "CrawlB"

Now make a new directory called "CrawlC"

mkdir CrawlC
cd CrawlC
mkdir crawldb
cd crawldb
mkdir current
cd current

Now copy the

cp -r CrawlA/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00000
cp -r CrawlB/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00001

NOTE the part-00001

Now make a directory segments under CrawlC
cd to CrawlC/segments

Now copy the

cp-r CrawlA/segments/* to CrawlC/segments/*
cp-r CrawlB/segments/* to CrawlC/segments/*

etc..

Now you should have under CrawlC two directory

crawldb
segments

Proceed with

- bin/nutch invertlinks linkdb segments/*
- bin/nutch index indexes crawldb linkdb segments/*
- bin/nutch dedup indexes
- bin/nutch merge index indexes

Change your searcher.dir in nutch-site.xml and give it a go..
Cheers

On 4/4/06, Olive g <[EMAIL PROTECTED]> wrote:
> We too have deadlines :(.
>
> I would appreciate it very much if someone can provide more insight. Is it a
> bug or
> configuration issue? How can we even do incremental crawsl on 0.8 with these
> issues?
>
> Should I send email to the developer mailing list? Would that help?
>
> Gurus, please help !!!!
>
>
>
> >From: "Vertical Search" <[EMAIL PROTECTED]>
> >Reply-To: [email protected]
> >To: [email protected]
> >Subject: Re: Merging indexes -- please help....
> >Date: Tue, 4 Apr 2006 10:11:51 -0500
> >
> >Sorry. I too have faced the same problem.. I am in process of releasing for
> >a demo  (mangement) over this weekend.
> >I will try to work on merging stuff after that... IT is a very important
> >part and have to get it to work, if I have to succeed in adopting Nutch for
> >a vertical domain.
> >Further more. I could not get the PruneIndexTool up and running.
> >It asks for query. I wonder if some can share the query file or format, the
> >tool expects.
> >
> >But goes without saying.. I am very thankful for folks here extending the
> >help.
> >
> >Thanks
> >
> >
> >
> >On 4/4/06, Olive g <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > I encountered the same problem on 0.8. See my post
> > >
> >http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
> > > Anyone has any idea? Is it a bug or a configuration issue? Please let me
> > > know.
> > > Thanks.
> > >
> > > Olive
> > >
> > > >From: "Dan Morrill" <[EMAIL PROTECTED]>
> > > >Reply-To: [email protected]
> > > >To: <[email protected]>
> > > >Subject: RE: Merging indexes -- please help....
> > > >Date: Mon, 3 Apr 2006 05:18:34 -0700
> > > >
> > > >Hi,
> > > >
> > > >I noticed that when I used the drive designation that it didn't like
> >that
> > > >(windows cygwin environment) if you did
> > > >
> > > >./nutch merge -local /STG1/index /STG1/indexes that may work better,
> >let
> > > me
> > > >know.
> > > >
> > > >Cheers/r/dan
> > > >H
> > > >-----Original Message-----
> > > >From: Vertical Search [mailto:[EMAIL PROTECTED]
> > > >Sent: Sunday, April 02, 2006 7:07 PM
> > > >To: [email protected]
> > > >Subject: Re: Merging indexes -- please help....
> > > >
> > > >Okay.
> > > >I had 2 sets of crawl
> > > >such as E:/STG1 and E/STG2
> > > >I used the dedup command to remove duplicates
> > > >Then I the command i used to merge is as follows
> > > ><based on what have been available on mail archieves and responses I
> >got
> > > >
> > > >First I can
> > > >
> > > >  bin/nutch merge E:/STG1/index E:/STG1/indexes
> > > >   bin/nutch merge E:/STG1/index E:/STG2/indexes
> > > >
> > > >In the nutch-site .xml I have searcher.dir ad E:/STG1
> > > >
> > > >I get the absolutely no results...The command console is as follows.
> > > >Can some one shed some light on this please ASAP..
> > > >
> > > >INFO: creating new bean
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > > >INFO: opening merged index in E:\Hoodukoo\STG5\index
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > > >INFO: opening segments in E:\Hoodukoo\STG5\segments
> > > >Apr 2, 2006 8:58:36 PM
> > > >org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
> > > >der
> > > >INFO: found resource common-terms.utf8 at
> > > >file:/C:/xampp/tomcat/webapps/hoodukoo
> > > >/WEB-INF/classes/common-terms.utf8
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > > >INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
> > > >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> > > >INFO: query request from 127.0.0.1
> > > >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> > > >INFO: query: site
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
> > > >INFO: searching for 20 raw hits
> > > >
> > >
> > > _________________________________________________________________
> > > Express yourself instantly with MSN Messenger! Download today - it's
> >FREE!
> > > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> > >
> > >
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Merging indexes -- please help....

Reply via email to