Vladimir Olenin wrote:
> - is there a place I can get already crawled internet web pages in an
> archive (10 - 100Gb of data)

I don't the sizes of the corpora mentioned on Lucene Wiki's Resources
page, but it's a good place to start:

<http://wiki.apache.org/jakarta-lucene/Resources#head-2bc05beb8e8c53c9f655491ea834661026f4e522>

Also, the International Conference on Weblogs and Social Media (ICWSM)
is making available a dataset of crawled blogs.  It comes in the form of
a set of compressed really big (~2GB uncompressed) XML files wrapping
the original HTML files with some associated metadata, and it amounts to
about 40GB uncompressed:

   <http://www.icwsm.org/data.html>

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to