I'm a corpus linguist who uses R for just about everything I do, but was 
recently disappointed when an R script took four and a half hours to create 
three frequency lists of single words, bigrams, and trigrams, despite using 
a function which calls compiled C code in one bottleneck. Curious to see if 
Julia could do it faster, I set out to translate the script into Julia. So 
far, my Julia script correctly gets the frequency of words, normalized 
frequency (per million words), and logged frequency. 

What I'm having trouble with now is getting dispersion measurements, 
specifically:
Range: the number of files that a word occurs in at least once
Stefan Gries' Deviation of Proportions (DP) measurement: a measurement that 
takes into account the relative size of the files that the words come from

In R code, the DP measurement is calculated as follows:
corpus = c("hello", "hola", "howdy", "what's up", "hello", "hey", "hello", 
"hello", "hola", "hello") # all words from all files in a vector
corpusParts = c("file1", "file1", "file1", "file1", "file1", "file2", 
"file2", "file2", "file3", "file3") # indicates from which file each word 
comes; has same length as "corpus"
corpusPartSizes = c(5, 3, 2) # number of words in each file

word = "hello"
f = sum(corpus == word) # 5 (there are five "hello"s in the corpus)
if(f == 0) { return("NA"); break() } # if "word" doesn't occur in corpus, 
move on to next word
l <- length(corpus) # 10 (total number of words in corpus)
s = corpusPartSizes / l # 0.5, 0.3, 0.2 (proportion that each file 
represents of the whole corpus)
v = rowsum(as.integer(corpus == word), corpusParts) # 2, 2, 1 ("hello" 
occurs twice in file1, twice in file2, and once in file3)
DP = sum(abs((v / f) - s)) / 2 # 0.1

"hello" has a range of 3 (it occurs in three files) and a DP of 0.1 in this 
corpus.

The line I can't figure out how to translate into Julia is:
v = rowsum(as.integer(corpus == word), corpusParts)

I assume there is a better way to approach this problem than simply 
translating my R script (nearly) line by line, but I don't know what that 
might be right now. But, for what it's worth, below is my Julia script as 
it now stands.

Thanks in advance for any help. Earl Brown

PS: Why does uppercase("más") return "MáS" (or "M\ue1S" in Julia Studio) 
rather than the expected "MÁS"?

######
# Julia script to loop over files and create a frequency list

using StatsBase
using DataFrames

cd("/Users/earlbrown/Corpora/United_States/California/Salinas/Textos/Finished/")
filesNames = filter(r"\.txt$"i, readdir())
outputFile = "/Users/earlbrown/Desktop/output_julia.csv"

# puts all lines into an array
strings = String[]
for i in 1:length(filesNames)
    #i = 2
    curFile = open(filesNames[i])
    curText = uppercase(readall(curFile))
    close(curFile)
    push!(strings, curText)
end # next file

# breaks up the lines and puts them into an array
words = Array[]
for i in 1:length(strings)
    curWds = split(strings[i], r"[^a-záéíóúüñ']+"i)
    push!(words, curWds)
end # next element

# gets frequency of words
words = join(words, " ")
words = split(words, r"\n+")
words = countmap(words)

# gets normalized frequency
wds = collect(keys(words))
freqs = collect(values(words))
normFreq = (freqs / sum(freqs)) * 1000000

# sorts words by frequency
df = @DataFrame(WORD => wds, FREQ => freqs, NORM => normFreq)
df = df[sortperm(df["FREQ"], rev=true), :]

# saves headers to output file
f = open(outputFile, "w")
headings = "RANK\tWORD\tFREQ\tLOG\tNORM"
write(f, headings, "\n")
close(f)

# saves words and their frequencies to output file
sep = "\t"
f = open(outputFile, "a")
rank = 0
for i in 1:length(wds)
    if (df[i, 1] == "")
       continue
    end
    rank = rank + 1
    curWd = df[i, 1]
    curFreq = df[i, 2]
    curLog = log(curFreq)
    curNorm = df[i, 3]
    output = string(rank, sep, curWd, sep, curFreq, sep, curLog, sep, 
curNorm)
    write(f, output, "\n")
end
close(f)

println("All done!")
######

Reply via email to