[CODE4LIB] HathiTrust Research Center Extracted Features 2.0

Downie, J Stephen Thu, 04 Jun 2020 13:56:53 -0700

Hi colleagues:

Because many of us teach or lead various text analytics and data mining classes 
and projects, some might find this open data set helpful.


Please share widely. The dataset was created to be used by all and sundry in 
and out of the classroom.

Discoveries await!

Cheers,
Stephen
************************************
HTRC is excited to announce the release of the Extracted Features 2.0 dataset! 
This new version of Extracted Features offers volume- and page-level data for 
17+ million volumes in the HathiTrust Digital Library. The data include:

  *   Bibliographic metadata
  *   Computationally-inferred metadata about the page, such as language and 
line counts
  *   Tokens (words), parts of speech, and their per-page counts
Overall, the dataset represents more than 6 billion pages of text from the 
digital library and includes nearly 3 trillion tokens from the corpus.

Not only does this release extend the number of volumes in HathiTrust available 
as Extracted Features, it also incorporates linked data such that names in the 
files are linked to external authorities when possible.

Learn more about the release and data schema: 
https://wiki.htrc.illinois.edu/x/kYC2B<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.htrc.illinois.edu_x_kYC2B&d=DwMFAg&c=Y6HT0gyZH_Z4ZSRJdNYJeQ&r=PoPNiojADUuqnTf-KX_TBzefh1aDEwmrF4a1xlfAZ-I&m=jIpyTDd57dx1dpU4liD2-4OMyQd5KxqDmGLDuV8Ooy8&s=33FGLOvfqEpo-r7Tl8B7zyKLrk8DU6M7vuPzUWEleA4&e=>
Download Extracted Features 2.0 files: 
https://wiki.htrc.illinois.edu/x/_QGGAQ<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.htrc.illinois.edu_x_-5FQGGAQ&d=DwMFAg&c=Y6HT0gyZH_Z4ZSRJdNYJeQ&r=PoPNiojADUuqnTf-KX_TBzefh1aDEwmrF4a1xlfAZ-I&m=jIpyTDd57dx1dpU4liD2-4OMyQd5KxqDmGLDuV8Ooy8&s=yJEVVbmvHZlQ_NbZhEoHR_LsXCGneLL3ZnqN5JIv4Wo&e=>

Contact [email protected]<mailto:[email protected]> with any 
questions.

[CODE4LIB] HathiTrust Research Center Extracted Features 2.0

Reply via email to