Hi, We're interested in using Hadoop for our application for purposes of replication and distribution of query execution. But I have some questions as to whether it's a good fit. We have essentially written a search engine using Jena (Semantic Web framework) and its accompanying Lucene interface called LARQ (Lucene ARQ) to allow for free-text search over the RDF graphs stored in Jena.
We expect the Lucene indexes to get very large, thus the need for Hadoop. I tried going through the documentation provided on the site, but want to clarify some points that we are unable to answer from the wiki, faq, etc: 1. We're not using Nutch, but the documentation seems to reference it frequently. Is this a problem? Can Lucene indexes alone be used with Hadoop without using Nutch? 2. Are there any best practices to using Hadoop behind such a setup in terms of creating/querying/managing the Lucene indexes? I found this thread ( http://www.mail-archive.com/[email protected]/msg00573.html ), but could use some clarification on several of the points mentioned. 3. How does Hadoop access, process & replicate the Lucene indexes in case we generate the indexes in our local file system as against HDFS? 4. Please provide a standard flow of execution as to how Hadoop works when Lucene is queried. Thanks, Vinaya --------------------------------- Check out what you're missing if you're not on Yahoo! Messenger
