I already have an implementation of a distributed crawler farm, where crawler instances are runnign on different boxes. I want to come up with a distributed indexing scheme using lucene and take advantage of the distributed nature of my crawlers' distributed nature. Here is what I am thinking.

Crawlers will analyze and tokenize the content for every URLs(aka Documents) and create the following data for every url document: <url-id, <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>> <field-2, <term-f2-t1,term-f2-t2,term-f2-t3, >> ...... >

And then based on some partitioning function the carwlers can send a subset of tokens(aka terms) to the indexing server. The partitioning function can be as simple as based on the starting character of the terms. Lets say if we have 5 indexers, we will distribute the indexing data in the following manner :

Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z

Does it make any sense ? Also would like to know if there are other ways to distribute lucene's indexing/searching ?

thanks,
prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to