On Fri, Jul 17, 2009 at 21:32, Andrzej Bialecki<a...@getopt.org> wrote: > Doğacan Güney wrote: >> >> Hey list, >> >> On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki<a...@getopt.org> wrote: >>> >>> Hi all, >>> >>> I think we should be creating a sandbox area, where we can collaborate >>> on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan >>> will >>> be importing his HBase work as 'nutchbase'. Tika work is the least >>> disruptive, so it could occur even on trunk. OSGI plugins work (which I'd >>> like to tackle) means significant refactoring so I'd rather put this on a >>> branch too. >>> >> >> Thanks for starting the discussion, Andrzej. >> >> Can you detail your OSGI plugin framework design? Maybe I missed the >> discussion but >> updating the plugin system has been something that I wanted to do for >> a long time :) >> so I am very much interested in your design. > > There's no specific design yet except I can't stand the existing plugin > framework anymore ... ;) I started reading on OSGI and it seems that it > supports the functionality that we need, and much more - it certainly looks > like a better alternative than maintaining our plugin system beyond 1.x ... >
Couldn't agree more with the "can't stand plugin framework" :D Any good links on OSGI stuff? > Oh, an additional comment about the scoring API: I don't think the claimed > benefits of OPIC outweigh the widespread complications that it caused in the > API. Besides, getting the static scoring right is very very tricky, so from > the engineer's point of view IMHO it's better to do the computation offline, > where you have more control over the process and can easily re-run the > computation, rather than rely on an online unstable algorithm that modifies > scores in place ... > Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will feel very natural in a hbase-backed nutch. Give me a couple more days to polish the scoring API then we can change it if you are not happy with it. > >> >>> Dogacan, you mentioned that you would like to work on Katta integration. >>> Could you shed some light on how this fits with the abstract indexing & >>> searching layer that we now have, and how distributed Solr fits into this >>> picture? >>> >> >> I haven't yet given much thought to Katta integration. But basically, >> I am thinking of >> indexing newly-crawled documents as lucene shards and uploading them >> to katta for searching. This should be very possible with the new >> indexing system. But so far, I have neither studied katta too much nor >> given much thought to integration. So I may be missing obvious stuff. > > Me too.. > >> About distributed solr: I very much like to do this and again, I >> think, this should be possible to >> do within nutch. However, distributed solr is ultimately uninteresting >> to me because (AFAIK) it doesn't have the reliability and >> high-availability that hadoop&hbase have, i.e. if a machine dies you >> lose that part of the index. > > Grant Ingersoll is doing some initial work on integrating distributed Solr > and Zookeeper, once this is in a usable shape then I think perhaps it's more > or less equivalent to Katta. I have a patch in my queue that adds direct > Hadoop->Solr indexing, using Hadoop OutputFormat. So there will be many > options to push index updates to distributed indexes. We just need to offer > the right API to implement the integration, and the current API is IMHO > quite close. > >> >> Are there any projects going on that are live indexing systems like >> solr, yet are backed up by hadoop HDFS like katta? > > There is the Bailey.sf.net project that fits this description, but it's > dormant - either it was too early, or there were just too many design > questions (or simply the committers moved to other things). > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney