Well, that crashed overnight, so now I have to restart anyway. Might as 
well do what I can to mitigate this problem before I make another attempt.

One thing I could do is to filter out only the columns I need before 
reading the data files. Each line (except the header line) consists of 14 
floats, and I really only need three of those. I can pick out the columns I 
need with awk:

$ awk -F',' -v OFS=',' '{print $1,$2,$4}'
Time,R,Z
0,4.714106791975944,1.021331973151819
2.45175e-08,4.714548484465481,1.02633406895731
etc...

Running this from Julia just as it is is no problem - I just wrap it in 
run(`awk ... `). However, I can't figure out how to redirect this to an IO 
object which I can read into a DataFrame. I've read the documentation 
section on running external programs, and skimmed the source in 
base/process.jl, but I'm not much wiser than I was before... My 
understanding is that something like this should be possible:

io = createAnIOObjectSomehow()
writeall(`awk ... `, io)
df = readtable(io)

but I have no idea how to create the IO object to make this work. I tried

trace = "trace-0.txt"
awkcmd = `awk -F',' -v OFS=',' '{print $1,$2,$4}' $trace`
io = Base.Pipe()
writeall(awkcmd, io)

but didn't get further than

ERROR: could not spawn `awk -F, -v OFS=, '{print $1, $2, $4}' trace-0.txt`: 
invalid argument (EINVAL)
 in _jl_spawn at process.jl:217
 in spawn at process.jl:348
 in writesto at process.jl:410

Just doing run(awkcmd) works fine. How do I do this?

// T


On Tuesday, May 27, 2014 5:31:41 PM UTC+2, Tomas Lycken wrote:
>
> I started a Julia script that processes a very large set of data, by 
> reading a large number (100k) of quite small text files, doing some 
> calculations, and aggregating the results. After running for a while I've 
> noticed that there seems to be some memory management issues, that I 
> suspect are just inefficient garbage collection. With some pseudo-elements, 
> my script does something like this:
>
> function process_all_the_stuff()
>     results1 = Float64[]
>     results2 = Float64[]
>     for i in 1:1e5
>         thisdata = read_text_file_with_index(i)
>         thisresult1 = do_calculation_1(thisdata)
>         thisresult2 = do_calculation_2(thisdata)
>         push!(results1, thisresult1)
>         push!(results2, thisresult2)
>     end
>     results1, results2
> end
>
> I've come about halfway, and htop looks like this:
>
>
> <https://lh3.googleusercontent.com/-rFSwZ9UtvIg/U4SvG5EL4xI/AAAAAAAAAMY/QYYbNCv-6l0/s1600/htop.png>
>
> As you see, I'm about to run out of memory. Is there any way I can 
> "inject" a call to gc(), say, at the end of the loop body, without 
> interrupting the script and loosing all the work done so far? Or will Julia 
> do so, when (if) she realizes memory is (too) scarce?
>
> If there isn't a way to do this, see this as the first step toward a 
> feature request :P
>
> // Tomas
>

Reply via email to