Hi David, Eric ~


I'm going to step in and offer a few clarifying comments about Constellate 
(https://constellate.org/, that JSTOR affiliated service Eric mentions!)



We are similar to say ProQuest's TDM Studio or Gales' Digital Scholar Lab in 
that we have a user interface for building datasets (with some visualizations) 
and a cloud based compute environment.  However, our model is a little 
different.  Constellate allows anyone in the world to build datasets of content 
and download them -- you do not need to participate in Constellate or have a 
subscription to the original content.  Content in Constellate is designated as 
either rights restricted or open.  When rights restricted data is included in a 
dataset, that dataset only includes non-consumptive data, whereas for open 
content, the dataset also includes the full-text.  Over 3 million documents in 
Constellate from JSTOR, Portico, and third party resources are open.  We also 
have a carve out for JSTOR rights restricted content, whereby after a formal 
request review, we will package that full-text up for researchers. These 
services are not part of our subscription service and are available to anyone.  
You can read more both about what content is included and what its 
permissions<https://constellate.org/docs/data-sources> are and our dataset 
options<https://constellate.org/docs/dataset-options>.   We currently limit 
datasets to 25,000 for not participants, but we are happy to help folks who 
need more content (most don't, however) and we are working to change the UI 
limit to a larger number (probably around 200,000 - it'll be the size above 
which the files simply become too big for most people.)  We don't get a lot of 
requests for larger datasets, so it hasn't become enough of a priority to bump 
our other to do list items.

Our primary focus is on teaching and learning - which is the real benefit of a 
Constellate subscription.  We believe that the ability to read, understand and 
communicate data as information is one of the most essential skills for the 
future of education and employment. Because of that, we sought to build a text 
analysis program that enables every librarian and faculty members in all 
disciplines to teach these skills. Constellate combines the content and tools 
users need to perform text analysis alongside a defined curriculum and robust 
tutorials, live classes taught by text analysis experts, and an inspiring and 
supportive community of users.  Our participants get to send their community 
members to our 
classes<https://constellate.org/events/constellate-class-intermediate-python> 
and use Constellate (including the cloud based compute environment) to teach!  
We don't actually see much research happening in the Constellate Lab.  As Eric 
said, most researchers want to do that work locally in their own environments.



If you'd like to learn more about Constellate or have questions, just let me 
know!



~ Amy


--
Amy J. Kirchhoff (she/her)
Constellate<https://constellate.org/> Business Manager / Portico, JSTOR
Twitter: @AmyPlusFour

Take your research further with text and data analysis skills!





-----Original Message-----
From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of Eric Lease 
Morgan
Sent: Tuesday, November 22, 2022 9:06 AM
To: CODE4LIB@LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] ProQuest TDM? Alternatives? Open source alternatives?



>>>>>Caution: This message did not originate from within ITHAKA's email

>>>>>system. Please use caution when opening attachments and following

>>>>>links within this message.<<<<<



On Nov 21, 2022, at 5:12 PM, RD B 
<rdbea...@gmail.com<mailto:rdbea...@gmail.com>> wrote:



> We (Kelvin Smith Library, Case Western Reserve University) are

> considering the ProQuest TDM Studio:

>

>   https://about.proquest.com/en/products-services/TDM-Studio/

>

> I was curious if anyone here had any direct experience with the system

> they could share, or if there were alternatives that the community

> recommends and why.

>

> --

> R. David Beales - rdbea...@gmail.com<mailto:rdbea...@gmail.com> - 
> 732-299-0390 Library, Earth,

> Sol System, Orion-Cygnus Arm of the Milky Way Galaxy, Laniakea

> Supercluster





A couple of years ago I experimented with TDM Studio, and I can report that it 
worked as advertised.



More specifically, Studio worked like the handful of similar services. One from 
Lexis/Nexus, one from JSTOR, and the one from the HathiTrust. What does that 
mean? It means a person:



  1. searches the given collection

  2. results are subsetted to a secure location

  3. using tools and APIs provided by the vendor,

     analysis is done against the results

  4. results are exported

  5. repeat until done



Many times the tools and API require a working knowledge of the Python 
programming language, and then there is the curve of learning the specific 
tools. The tools usually include a number of modeling techniques: bibliography 
creation, ngram analysis, topic modeling, and full text searching. After 
working in this area for a more than a few years now, these techniques ought to 
be considered rudimentary, and additional techniques such as the application of 
grammars, semantic indexing, and collocations ought to be included.



All of the vendors have their hands tied by contract and copyright. Each vendor 
has made agreements with publishers not to freely share content, but it is not 
possible to do text mining, natural language processing, nor data science with 
words sans the content. Consequently each vendor implements a variation on Step 
#2, above. The process would be a h3ll of a lot easier if the student, 
researcher, or scholar could:



  1. search content

  2. select items of interest

  3. download selected items sans click,

     save, click, save, click, save, etc.

  4. use a wide variety of GUI tools,

     command-line tools, or programming

     languages to do the analysis



Here licensing is probably the limiting factor, not copyright.



Do I know of open source alternatives? No, not really, but I hope my Reader 
addresses some of these problems. Given a set of files of just about any number 
and just about any ilk and saved in a local folder/directory, the Reader:



  * converts the files to plain text

  * does all sorts of feature extraction against the result

  * distills the features into a data set (a "study carrel")

  * provides the means to compute against the data set, and

    the computing could done with GUI tools (like OpenRefine

    or AntConc), command-line tools (like grep or jq), or

    programming libraries (like Python's NLTK or spaCy)



In the end the Reader supports all of the modeling techniques alluded to above 
as well as a few others. Consequently, a person can search any vendor for 
content of interest, download the results (through click, click, click), and do 
analysis against the result.



Like all software, the Reader is never done and ought to be considered 
beta-ware, See:



  https://distantreader.org



HTH



P.S. David, nice signature.



--

Eric Lease Morgan

Navari Family Center for Digital Scholarship Hesburgh Libraries University of 
Notre Dame



574/631-8604

https://cds.library.nd.edu

Reply via email to