-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thank you very much for your reply. Yeah, our indexes (indices?) contain different types and amounts of data. :( The data being indexed is all the same format - RDF - but it describes different numbers and kinds of things.
What is your gut feeling on whether or not it's a good idea for us to roll our own? Katta is a contender, but we already have a fairly complex system, and adding anything Hadoop-related feels like it might push us over a tipping point into the realm of unwieldy overcomplexity. But, this is a hard problem after all, so some amount of complexity is inevitable. On 06/02/2011 07:05 PM, Erick Erickson wrote: > As you've found out, raw scores certainly aren't comparable across > different indexes > #unless# the documents are fairly distributed. You're talking large > indexes here, > so if the documents are balanced across all your indexes, the results should > be > pretty comparable. This pre-supposes that the indexes share a common schema > and that the distributions of terms are "close enough to identical" to be > truly > comparable. And it supposes that your indexes are similar in > character. It wouldn't > work if one of your indexes had, say, meta-data from videos and another had > scholarly journal articles. > > Otherwise, there's work going on in Solr that might help, although I > don't know when > that'll be available. > > Other than that, I don't know what to suggest. It's not an easy > problem or Solr/Lucene > would already have solved it.. siiiggggh. > > Best > Erick > > On Thu, Jun 2, 2011 at 3:51 PM, Clint Gilbert > <clint_gilb...@hms.harvard.edu> wrote: > Hi everyone, > > I searched the list archives, but couldn't find a question that closely > matches mine. > > The project I'm working on is designed to allow searching a distributed > collection of data repositories. Currently, we index each repository to > build a central Lucene index. This works ok, but for practical (the > central index is getting very large) and architectural (decentralization > is a design goal) reasons, we'd like to distribute the index. > > In the past, we had basic federation system in place: when a user > submitted a query, the query was broadcast to each data repository, > which had its own independent Lucene index. Results from each repo were > aggregated in reverse order. > > The problem was, of course, that since each index was constructed > independently of all the others, and documents are distributed in the > repos unevenly, it was impossible to rank the results from all the > indices in a meaningful way. We basically punted and interleaved > results, which didn't gave a bad user experience, hence the temporary > switch to a central index. > > So, what options exist for searching distributed collections of Lucene > indices and ranking results meaningfully? > > Katta seems promising, but I don't know enough about it yet. It also > seems to want to open its own ports for RPC. I'd prefer something that > could tunnel over HTTP to minimize firewall drama. (We will have 10s > and then 100s of data repos running in separate locations.) > > We're also considering a home-grown scheme involving normalizing the > denominators of all the index components in all our indices, based on > the sums of counts obtained from all the indices. This feels like > re-inventing the wheel, and it's not clear to me yet that the low-level > manipulation of indices that we'd need to do is even possible. > > Any suggestions for distributing indices while ranking results well are > very welcome! >> - --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk3oIZUACgkQ5IyIbnMUeTuemwCeMfolvNVEjve9fIEJHy3N3TV/ 0VIAn2Xf+ypB5PRS45ekmiXEDhmvDdhZ =jtE9 -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org