(Apologies for cross-posting)
Version 2.0 of the HPLT Datasets is now published, with web-derived
corpora in 193 languages.
These collections are available under the Creative Commons CC0 license
and bring significant improvements compared to previous releases
(version 1.2). Similarly to 1.2, the release comes in two variants:
de-duplicated (21 TB in size) and cleaned (15 TB in size). The cleaned
variant contains the same documents as de-duplicated minus those
filtered out by our cleaning heuristics. The cleaned variant is the
recommended one, unless you want to try your own cleaning pipelines.
Download the corpora here:
https://hplt-project.org/datasets/v2.0
Similar to the previous releases, version 2.0 datasets are hosted by
Sigma2 NIRD Data Lake (https://www.sigma2.no/service/nird-data-lake),
and text extraction pipeline was run on LUMI supercomputer
(https://lumi-supercomputer.eu/).
*What's new*
- The size of the source web collections has increased 2.5x: 4.5
petabytes of compressed web data in total, mostly from Internet Archive
(https://archive.org/), but also from Common Crawl
(https://commoncrawl.org/).
- The text extraction pipeline now uses Trafilatura
(https://trafilatura.readthedocs.io/), which results in more efficient
boilerplate removal: thus, less noise in the data.
- Language identification now uses a refined version of OpenLID
(https://aclanthology.org/2023.acl-short.75/).
- This, in turn, allowed us to publish data in 193 languages, compared
to 75 languages in version 1.2.
- We switched from two-letter ISO 639-1 language codes to three-letter
ISO 639-3 language codes, augmented with a postfix denoting writing
system. For example, `pol_Latn` is Polish written in Latin script.
Mapping from the old to the new codes is available at
https://github.com/hplt-project/warc2text-runner/blob/main/stats/_langs/langs_HPLTv2.tsv.
- The documents are now annotated with their compliance to the
robots.txt file of the original website. This metadata field can be used
to filter out documents explicitly forbidden for crawling by website
owners, making the resulting corpora somewhat less prone to copyright
violations . The cleaned variant contains only robots.txt compliant
documents. More details at
https://github.com/hplt-project/monotextor-slurm/blob/main/README.md#robotstxt-compliance
- De duplication is done at collection-level, not at dataset level.
- Documents have also been annotated for PII information with
multilingual-pii-tool
(https://github.com/mmanteli/multilingual-PII-tool). These are
identified in the form of Unicode character offsets for every match.
- Segment-level language-model-based scores have been replaced by
document quality scores computed with web-docs-scorer.
- Filtering and cleaning criteria have been simplified
(https://github.com/hplt-project/monotextor-slurm?tab=readme-ov-file#filters).
HPLT Monolingual Datasets version 2.0 (the de-duplicated variant)
feature about 7.6 trillion whitespace-separated words and about 52
trillion characters extracted from 21 billion documents, compared to 5.6
trillion words and 42 trillion characters extracted from 5 billion
documents in version 1.2. All in all, you can expect less noise and
boilerplate, less duplicates, more unique documents, and generally
better quality texts to train language models on or for other NLP tasks.
*How was this dataset produced*
You may want to read section 3 of our Deliverable HPLT pipeline and
tools
(https://hplt-project.org/HPLT_D7_2___HPLT_pipelines_and_tools.pdf) to
have a full description on how did we produced this dataset. If you
don't have much time for reading, maybe this chart is enough for your
purposes:
https://hplt-project.org/_next/static/media/dataset-pipeline-light.c2521ee1.svg
Each language is accompanied with an HPLT Analytics report. These
automated reports provide useful information and statistics about the
clean version of the HPLT v.2.0 datasets. They are the result of running
the HPLT Analytics Tool
(https://github.com/hplt-project/data-analytics-tool) on them. They are
helpful for inspecting the datasets even before downloading them.
*What is HPLT?*
HPLT (High Performance Language Technologies) is an EU Horizon Europe
funded project which aims at collecting large quantities of data in many
languages and training powerful and efficient language and translation
models. An important feature of HPLT is openness and transparency: all
the artifacts of the project are publicly available under permissive
licenses.
https://hplt-project.org/
--
Andrey
Language Technology Group (LTG)
University of Oslo
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]