[EMAIL PROTECTED] wrote:
Hi,
I see Andrzej is getting busy with JIRA.... new release in April?
That's the idea ... I could use some help, too ;)
I just noticed something nice in Hadoop's svn repo - distributed
Lucene indexer. Is anyone thinking about providing support for that
in Nutch? Or do people think this is not needed because in the end
people tend to create a number of relatively small indices (5-10M
docs) as opposed to one larger index?
It caught my eye, too. I think it's nice that this tool uses low-level
knowledge of Lucene segments to minimize the index churn - however, if I
understand it correctly, the resulting indexes it maintains are still
located on HDFS. Also, this doesn't address the extended concept of
"shard" in Nutch, which consists not only of a Lucene index but also
contains binary content, parse data and parse text ...
From our point of view it would be useful if it were to move two steps
further, i.e. include the management of other binary data (no longer so
trivial, eh?), and then offer a functionality to do this transparently
so that the shards end up on local filesystems of search servers, and
the low-level segment management is done there ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com