To stop word, or not to stop word? That is the question.

Seriously, I am working with a team of people to index and analyze a set of 
65,000 - 100,000 full text scientific journal articles, and all of the articles 
are on the topic of COVID-19. [1] We have indexed the data set and we have 
created subsets of the data, affectionately called "study carrels". Each study 
carrel is characterized with a short name and a few bibliographic-like 
features. [2] Within each study carrel are a number of different analyses, such 
as ngram frequencies, parts-of-speech enumerations, and topic modeling.

Each article in each carrel also has a set of "keywords" extracted from it. 
These keywords are computed, and for all intents & purposes, the computation is 
pretty good. For example, see a set of keywords from a particular carrel. [3] 
Unfortunately, many of the study carrels have very very very similar sets of 
keywords. Again, if you peruse the set of all the carrels [2] you see the 
preponderance of keywords such as "cell", "covid-19", "SARS", and "patient". 
These words happen so frequently that they become (almost) meaningless.

My questions to y'all are, "When and where should I add something like 'cell', 
or better yet 'covid-19', to my list of stopwords?"


[1] data set of articles - https://www.semanticscholar.org/cord19
[2] study carrels - https://cord.distantreader.org/carrels/INDEX.HTM
[3] example keywords - 
https://cord.distantreader.org/carrels/kaggle-risk-factors/index.htm#keywords

--
Eric Morgan

Reply via email to