Perhaps I have my terminology wrong, so I am looking at this the wrong way. If I want to distribute my search across multiple nodes, having only a portion of the data on each node, is this just a matter of using mergesegs to get the number and size of segments I want, then rebuild the index (house cleaning, dedup, invert, etc) with the new quantity of segments, then copy only a portion of the segments to each search server, as well as the whole crawldb, index,indexes, and linkdb directories?
So, 1. mergesegs ...... -split 100 (creates 20 segments) 2. index, invert, dedupe and stuff 3. scp 10 segments to node 1 4. scp 10 segments to node 2 As long as slave nodes are configured correctly, a search will span both nodes? Is the above somewhat correct? The split index questions others have asked, as well as seeing that others are indexing 50m+ pages across several nodes, leads me to believe there is some sort of standard process or tools for distributing the index and segments across multiple nodes. So far I don't have enough understanding of the terminology to know what to search for, or others are keeping tight lipped about how they are doing this. If I do manage to get this working, with the help of others, I'd be willing to write up a quick tutorial/faq about this to hopefully stop newbies like me from asking this over and over again. :-) Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Wed, Sep 23, 2009 at 5:48 AM, Jesse Hires <jhi...@gmail.com> wrote: > Exactly! sorry for being so confusing in my original question. > > > > Jesse > > int GetRandomNumber() > { > return 4; // Chosen by fair roll of dice > // Guaranteed to be random > } // xkcd.com > > > > > On Wed, Sep 23, 2009 at 4:45 AM, Alexander Aristov < > alexander.aris...@gmail.com> wrote: > >> Ok, I will paraphrase the question. >> >> Consider I want to use distributed search using 3 servers: one primary and >> two secondary nodes. >> >> I create single BIG index using distributed crawler using other computers. >> Now I want to split this single BIG index on two parts to put on the >> search >> nodes. >> >> How can it be achieved? >> >> Best Regards >> Alexander Aristov >> >> >> 2009/9/23 Koch Martina <k...@huberverlag.de> >> >> > Hi Jesse, >> > >> > I'm not sure what you're trying to achieve. Do you want to use the >> > distributed search or do you want to split an existing index? None of >> these >> > tasks is the prerequisite for the other. >> > If you want to split an index, there are several ways to do this. Which >> way >> > to choose depends on the reason for the split. >> > If you want to use the distributed search, you just need two or more >> > separate indexes, start a search server for each and configure your >> > searcher.dir property in nutch-site xml to point to the >> search-servers.txt >> > file, where you entered the hosts and ports of your search servers >> (detailed >> > description: >> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html >> ). >> > >> > Kind regards, >> > Martina >> > >> > >> > -----Ursprüngliche Nachricht----- >> > Von: Jesse Hires [mailto:jhi...@gmail.com] >> > Gesendet: Mittwoch, 23. September 2009 04:59 >> > An: nutch-user@lucene.apache.org >> > Betreff: splitting an index (yes, again) >> > >> > My apologies in advance. >> > >> > I've been digging through the mail archives searching for information on >> > splitting the index after crawling, but I am getting even more confused >> or >> > the information is too incomplete for a newbie like myself. >> > >> > I see reference to using mergesegs, but not enough to make an educated >> > guess >> > (at least at my level, which I admit is low right now). >> > >> > I've gotten to the point of having worked my way through the tutorial >> here: >> > http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial >> > and have a working site using a single computer. I have four more >> computers >> > to add, and would like to try distributed search. >> > >> > When I read that tutorial to the Distributed Searching portion followed >> by >> > "split the index" it mentions this link: >> > >> > >> http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917 >> > >> > But that may as well be saying "then some magic happens". >> > >> > Does anyone have "step by step" instructions for spitting the index for >> use >> > in distributed search using mergesegs or otherwise? It doesn't have to >> have >> > a lot of explanation, just a list of example steps. >> > >> > >> > Mostly this is experimental for me with no major plans than my own >> > education, but because I am starting completely fresh at this, some >> things >> > are still quite confusing. >> > >> > Thanks, >> > Jesse >> > >> > >