Re: splitting an index (yes, again)

Jesse Hires Fri, 25 Sep 2009 10:16:53 -0700

Perhaps I have my terminology wrong, so I am looking at this the wrong way.
If I want to distribute my search across multiple nodes, having only a
portion of the data on each node, is this just a matter of using mergesegs
to get the number and size of segments I want, then rebuild the index (house
cleaning, dedup, invert, etc) with the new quantity of segments, then copy
only a portion of the segments to each search server, as well as the whole
crawldb, index,indexes, and linkdb directories?


So,
1. mergesegs ...... -split 100 (creates 20 segments)
2. index, invert, dedupe and stuff
3. scp 10 segments to node 1
4. scp 10 segments to node 2

As long as slave nodes are configured correctly, a search will span both
nodes?

Is the above somewhat correct?

The split index questions others have asked, as well as seeing that others
are indexing 50m+ pages across several nodes, leads me to believe there is
some sort of standard process or tools for distributing the index and
segments across multiple nodes.

So far I don't have enough understanding of the terminology to know what to
search for, or others are keeping tight lipped about how they are doing
this.




If I do manage to get this working, with the help of others, I'd be willing
to write up a quick tutorial/faq about this to hopefully stop newbies like
me from asking this over and over again. :-)


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Wed, Sep 23, 2009 at 5:48 AM, Jesse Hires <jhi...@gmail.com> wrote:

> Exactly! sorry for being so confusing in my original question.
>
>
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
>
> On Wed, Sep 23, 2009 at 4:45 AM, Alexander Aristov <
> alexander.aris...@gmail.com> wrote:
>
>> Ok, I will paraphrase the question.
>>
>> Consider I want to use distributed search using 3 servers: one primary and
>> two secondary nodes.
>>
>> I create single BIG index using distributed crawler using other computers.
>> Now I want to split this single BIG index on two parts to put on the
>> search
>> nodes.
>>
>> How can it be achieved?
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> 2009/9/23 Koch Martina <k...@huberverlag.de>
>>
>> > Hi Jesse,
>> >
>> > I'm not sure what you're trying to achieve. Do you want to use the
>> > distributed search or do you want to split an existing index? None of
>> these
>> > tasks is the prerequisite for the other.
>> > If you want to split an index, there are several ways to do this. Which
>> way
>> > to choose depends on the reason for the split.
>> > If you want to use the distributed search, you just need two or more
>> > separate indexes, start a search server for each and configure your
>> > searcher.dir property in nutch-site xml to point to the
>> search-servers.txt
>> > file, where you entered the hosts and ports of your search servers
>> (detailed
>> > description:
>> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html
>> ).
>> >
>> > Kind regards,
>> > Martina
>> >
>> >
>> > -----Ursprüngliche Nachricht-----
>> > Von: Jesse Hires [mailto:jhi...@gmail.com]
>> > Gesendet: Mittwoch, 23. September 2009 04:59
>> > An: nutch-user@lucene.apache.org
>> > Betreff: splitting an index (yes, again)
>> >
>> > My apologies in advance.
>> >
>> > I've been digging through the mail archives searching for information on
>> > splitting the index after crawling, but I am getting even more confused
>> or
>> > the information is too incomplete for a newbie like myself.
>> >
>> > I see reference to using mergesegs, but not enough to make an educated
>> > guess
>> > (at least at my level, which I admit is low right now).
>> >
>> > I've gotten to the point of having worked my way through the tutorial
>> here:
>> > http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial
>> > and have a working site using a single computer. I have four more
>> computers
>> > to add, and would like to try distributed search.
>> >
>> > When I read that tutorial to the Distributed Searching portion followed
>> by
>> > "split the index" it mentions this link:
>> >
>> >
>> http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917
>> >
>> > But that may as well be saying "then some magic happens".
>> >
>> > Does anyone have "step by step" instructions for spitting the index for
>> use
>> > in distributed search using mergesegs or otherwise? It doesn't have to
>> have
>> > a lot of explanation, just a list of example steps.
>> >
>> >
>> > Mostly this is experimental for me with no major plans than my own
>> > education, but because I am starting completely fresh at this, some
>> things
>> > are still quite confusing.
>> >
>> > Thanks,
>> > Jesse
>> >
>>
>
>

Re: splitting an index (yes, again)

Reply via email to