Scott Green wrote:
Hi list,
Firstly, i don't know whether nutch-dev mail list is suitable for this
topic or not. If I post in the wrong place, pls tell me where should I
ask this question. Thanks.
The question is how to index resource in real time in nutch? This
question is raised from GMail. I don't know what exactly behind GMail,
but it should be built on GFS. When I get one email or send one email
out, push the "Search Mail" immediately and it always get it. I'll
appreciate if someone will to explain how GMail works.
And any advice to hack Nutch/Hadoop to archive this? Thanks
hi,
Most of the projects in google uses a scalable data structure called
bigtable. Orkut, google earth, finance and writley is reported to use
this. And i suppose Gmail also uses bigtable. Bigtable is build upon GFS
and desined to scale at petabyte lavel, but they work to icrease it to
the next level.
As far as i know, you should rebuild the index every time or merge the
indexes, so there is not an online index building. Consider asking this
to lucene mailing list.