Wikipedia XML Corpus for research
Ludovic DENOYER LIP6 - University of Paris 6 http://www-connex.lip6.fr/~denoyer/wikipediaXML Technical report (currently Draft): http://www-connex.lip6.fr/~denoyer/homepage/publications/TECHREP2006.pdf ============= This is an announcement for the release of a set of large XML document collections. These collections might be of interest to the Information Retrieval Community and to the Machine Learning community. These collections have been developped as a joint project between the DELOS and PASCAL Networks of Excellence. =========== We propose a large set of XML collections based on Wikipedia. These collections can be used in a large variety of XML IR/Machine Learning tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. These corpora are, for example, used for INEX 2006 competition (http://inex.is.informatik.uni-duisburg.de/2006) and for the XML Document Mining Challenge (http://xmlmining.lip6.fr). Brief Collections description: - 8 Different languages: English, German, French, Dutch, Spanish, Chinese, Arabian, Japanese - 660,000 documents for the English collection - All documents are organized in a hierarchy of categories - Some collections have been build for the comparison of categorization/clustering algorithms - Multimedia Collection (more than 300,000 pictures) - Entity Collection Other collections (Cross-Language, NLP Collection) will be provided soon. More information on the web site: http://www-connex.lip6.fr/~denoyer/wikipediaXML Best regards, Ludovic DENOYER Assistant Professor http://www-connex.lip6.fr/~denoyer _______________________________________________ Mt-list mailing list
