All,
We have two version of a type of index splitter. The first version
would run an indexing job and then using the completed index as input
would read the number of documents in the index and take a requested
split size. From this it used a custom index input format to create
splits according to document id. We would run a job that would map out
index urls as keys and documents with their ids wrapped in a
SerializableWritable object as the values. Then inside of a second job
using the index as input we would have a MapRunner that would read the
other supporting databases (linkdb, segments) and map all objects as
ObjectWritables. Then on the reduce we had a custom Output and
OutputFormat that took all of the objects and wrote out the databases
and indexes into each split.
There was a problem with this first approach though in that writing out
an index from a previously serialized document would lose any fields
that are not stored (which is most). So we went with a second approach.
The second approach takes a number of splits and runs through an
indexing job on the fly. It calls the indexing and scoring filters. It
uses the linkdb, crawldb, and segments as input. As it indexes is also
splits the databases and indexes into the number of reduce tasks so that
the final output is multiple splits each hold a part of the index and
its supporting databases. Each of the databases holds only the
information for the urls that are in its part of the index. These parts
can then be pushed to separate search servers. This type of splitting
works well but you can NOT define a specific number of documents or urls
per split and sometimes one split will have alot more urls than another
if you are indexing some sites that have alot of pages (i.e. wikipedia
or cnn archives). This is currently how our system works. We fetch,
invert links, run through some other processes, and then index and split
on the fly. Then we use python scripts to pull each split directly from
the DFS to each search server and then start the search servers.
We are still working on the splitter because the ideal approach would be
to be able to specify a number of documents per split as well as to
group by different keys, not just url. I would be happy to share the
current code but it is highly integrated so I would need to pull it out
of our code base first. It would be best if I could send it to someone,
say Andrzej, to take a look at first.
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
[...]
Having a new index on each machine and having to create separate
indexes is not the most elegant way to accomplish this architecture.
The best way that we have found is to have an splitter job that
indexes and splits the index and
Have you implemented a Lucene index splitter, i.e. a tool that takes
an existing Lucene index and splits it into parts by document id? This
sounds very interesting - could you tell us a bit about this?