Hi all, I'm looking for documentation about web repository architectures and search engines' storage modules in general. I found Nutch while searching the web, and I congratulate Nutch's developers on they great work. I read the available documentation about how Nutch stores crawled objects both locally and in a distributed way (NDFS), but as part of a university project I'm looking for more docs about storage even not Nutch related, and I'm writing here in the hope someone has some good link to check. I already read the WebBase paper which studies the possible storage solutions for a web base [1] and the Internet Archive ARC file format and usage [2] and I'm interested in something like that. Many thanks to anyone that can help me, and I hope my request doesn't sound offtopic, as understanding the current state of art of web storage can help Nutch too.
[1] http://dbpubs.stanford.edu:8090/pub/1999-26 [2] http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionProposal Bye. -- Francesco
