Re: distributed search

Dennis Kubes Mon, 04 Dec 2006 15:29:51 -0800

All,

We have two version of a type of index splitter. The first versionwould run an indexing job and then using the completed index as inputwould read the number of documents in the index and take a requestedsplit size. From this it used a custom index input format to createsplits according to document id. We would run a job that would map outindex urls as keys and documents with their ids wrapped in aSerializableWritable object as the values. Then inside of a second jobusing the index as input we would have a MapRunner that would read theother supporting databases (linkdb, segments) and map all objects asObjectWritables. Then on the reduce we had a custom Output andOutputFormat that took all of the objects and wrote out the databasesand indexes into each split.

There was a problem with this first approach though in that writing outan index from a previously serialized document would lose any fieldsthat are not stored (which is most). So we went with a second approach.

The second approach takes a number of splits and runs through anindexing job on the fly. It calls the indexing and scoring filters. Ituses the linkdb, crawldb, and segments as input. As it indexes is alsosplits the databases and indexes into the number of reduce tasks so thatthe final output is multiple splits each hold a part of the index andits supporting databases. Each of the databases holds only theinformation for the urls that are in its part of the index. These partscan then be pushed to separate search servers. This type of splittingworks well but you can NOT define a specific number of documents or urlsper split and sometimes one split will have alot more urls than anotherif you are indexing some sites that have alot of pages (i.e. wikipediaor cnn archives). This is currently how our system works. We fetch,invert links, run through some other processes, and then index and spliton the fly. Then we use python scripts to pull each split directly fromthe DFS to each search server and then start the search servers.

We are still working on the splitter because the ideal approach would beto be able to specify a number of documents per split as well as togroup by different keys, not just url. I would be happy to share thecurrent code but it is highly integrated so I would need to pull it outof our code base first. It would be best if I could send it to someone,say Andrzej, to take a look at first.


Dennis

Andrzej Bialecki wrote:

Dennis Kubes wrote:

[...]
Having a new index on each machine and having to create separateindexes is not the most elegant way to accomplish this architecture.The best way that we have found is to have an splitter job thatindexes and splits the index and
Have you implemented a Lucene index splitter, i.e. a tool that takesan existing Lucene index and splits it into parts by document id? Thissounds very interesting - could you tell us a bit about this?

Re: distributed search

Reply via email to