It's funny but merge is not ran as a job so you end up with one folder with
the merged index in it no parts there.

Let's say you have 2 separate indexes created in 2 separate runs.
Now let's say that one index is located at crawl/index_1 and the second is
in crawl/index_2

So now in each of those folders you have a folder part-000 something so your
tree looks like crawl/index_1/part-00000 right?

Now do the following:
1. bin/hadoop dfs -mkdir crawl/indexes
2. bin/hadoop dfs -cp crawl/index_1/part-00000 crawl/indexes/index_1_part_0
3. bin/hadoop dfs -cp crawl/index_2/part-00000 crawl/indexes/index_2_part_0
4. bin/nutch merge crawl/newindex crawl/indexes

When done you should have a new folder (crawl/newindex) with the merged
index in it.

HTH,

Gal

-----Original Message-----
From: Brian Whitman [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 16, 2007 12:06 AM
To: [email protected]
Subject: Re: crawl indexes and part-00000

> The merge program doesn't care what the name of the folder is. It  
> cares it
> should be in a certain structure.
>
> So if we assume you have a folder named indexes, the program wants  
> that each
> folder inside indexes (represents a previous run of index) should  
> have a
> Lucene index in it (it looks for a folder name segments).


Thanks Gal for the explanation. It makes sense.

What doesn't though is that

bin/nutch merge crawl/index crawl/index_1 crawl/index_2 crawl/index

(i.e. merging three indexes including the previously merged one) will  
not generate the part-00000 in crawl/index, it just dumps the merged  
Lucene index directly into crawl/index. So then the next time I do a  
crawl merge I have to manually move the crawl/index/* to crawl/index/ 
part-00000/.

But knowing this at least is helpful so I can update my scripts!

-Brian





-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to