I'm a corpus linguist who uses R for just about everything I do, but was
recently disappointed when an R script took four and a half hours to create
three frequency lists of single words, bigrams, and trigrams, despite using
a function which calls compiled C code in one bottleneck. Curious to see if
Julia could do it faster, I set out to translate the script into Julia. So
far, my Julia script correctly gets the frequency of words, normalized
frequency (per million words), and logged frequency.
What I'm having trouble with now is getting dispersion measurements,
specifically:
Range: the number of files that a word occurs in at least once
Stefan Gries' Deviation of Proportions (DP) measurement: a measurement that
takes into account the relative size of the files that the words come from
In R code, the DP measurement is calculated as follows:
corpus = c("hello", "hola", "howdy", "what's up", "hello", "hey", "hello",
"hello", "hola", "hello") # all words from all files in a vector
corpusParts = c("file1", "file1", "file1", "file1", "file1", "file2",
"file2", "file2", "file3", "file3") # indicates from which file each word
comes; has same length as "corpus"
corpusPartSizes = c(5, 3, 2) # number of words in each file
word = "hello"
f = sum(corpus == word) # 5 (there are five "hello"s in the corpus)
if(f == 0) { return("NA"); break() } # if "word" doesn't occur in corpus,
move on to next word
l <- length(corpus) # 10 (total number of words in corpus)
s = corpusPartSizes / l # 0.5, 0.3, 0.2 (proportion that each file
represents of the whole corpus)
v = rowsum(as.integer(corpus == word), corpusParts) # 2, 2, 1 ("hello"
occurs twice in file1, twice in file2, and once in file3)
DP = sum(abs((v / f) - s)) / 2 # 0.1
"hello" has a range of 3 (it occurs in three files) and a DP of 0.1 in this
corpus.
The line I can't figure out how to translate into Julia is:
v = rowsum(as.integer(corpus == word), corpusParts)
I assume there is a better way to approach this problem than simply
translating my R script (nearly) line by line, but I don't know what that
might be right now. But, for what it's worth, below is my Julia script as
it now stands.
Thanks in advance for any help. Earl Brown
PS: Why does uppercase("más") return "MáS" (or "M\ue1S" in Julia Studio)
rather than the expected "MÁS"?
######
# Julia script to loop over files and create a frequency list
using StatsBase
using DataFrames
cd("/Users/earlbrown/Corpora/United_States/California/Salinas/Textos/Finished/")
filesNames = filter(r"\.txt$"i, readdir())
outputFile = "/Users/earlbrown/Desktop/output_julia.csv"
# puts all lines into an array
strings = String[]
for i in 1:length(filesNames)
#i = 2
curFile = open(filesNames[i])
curText = uppercase(readall(curFile))
close(curFile)
push!(strings, curText)
end # next file
# breaks up the lines and puts them into an array
words = Array[]
for i in 1:length(strings)
curWds = split(strings[i], r"[^a-záéíóúüñ']+"i)
push!(words, curWds)
end # next element
# gets frequency of words
words = join(words, " ")
words = split(words, r"\n+")
words = countmap(words)
# gets normalized frequency
wds = collect(keys(words))
freqs = collect(values(words))
normFreq = (freqs / sum(freqs)) * 1000000
# sorts words by frequency
df = @DataFrame(WORD => wds, FREQ => freqs, NORM => normFreq)
df = df[sortperm(df["FREQ"], rev=true), :]
# saves headers to output file
f = open(outputFile, "w")
headings = "RANK\tWORD\tFREQ\tLOG\tNORM"
write(f, headings, "\n")
close(f)
# saves words and their frequencies to output file
sep = "\t"
f = open(outputFile, "a")
rank = 0
for i in 1:length(wds)
if (df[i, 1] == "")
continue
end
rank = rank + 1
curWd = df[i, 1]
curFreq = df[i, 2]
curLog = log(curFreq)
curNorm = df[i, 3]
output = string(rank, sep, curWd, sep, curFreq, sep, curLog, sep,
curNorm)
write(f, output, "\n")
end
close(f)
println("All done!")
######