Vladimir Olenin wrote: > - is there a place I can get already crawled internet web pages in an > archive (10 - 100Gb of data)
I don't the sizes of the corpora mentioned on Lucene Wiki's Resources page, but it's a good place to start: <http://wiki.apache.org/jakarta-lucene/Resources#head-2bc05beb8e8c53c9f655491ea834661026f4e522> Also, the International Conference on Weblogs and Social Media (ICWSM) is making available a dataset of crawled blogs. It comes in the form of a set of compressed really big (~2GB uncompressed) XML files wrapping the original HTML files with some associated metadata, and it amounts to about 40GB uncompressed: <http://www.icwsm.org/data.html> Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]