Well, that crashed overnight, so now I have to restart anyway. Might as
well do what I can to mitigate this problem before I make another attempt.
One thing I could do is to filter out only the columns I need before
reading the data files. Each line (except the header line) consists of 14
floats, and I really only need three of those. I can pick out the columns I
need with awk:
$ awk -F',' -v OFS=',' '{print $1,$2,$4}'
Time,R,Z
0,4.714106791975944,1.021331973151819
2.45175e-08,4.714548484465481,1.02633406895731
etc...
Running this from Julia just as it is is no problem - I just wrap it in
run(`awk ... `). However, I can't figure out how to redirect this to an IO
object which I can read into a DataFrame. I've read the documentation
section on running external programs, and skimmed the source in
base/process.jl, but I'm not much wiser than I was before... My
understanding is that something like this should be possible:
io = createAnIOObjectSomehow()
writeall(`awk ... `, io)
df = readtable(io)
but I have no idea how to create the IO object to make this work. I tried
trace = "trace-0.txt"
awkcmd = `awk -F',' -v OFS=',' '{print $1,$2,$4}' $trace`
io = Base.Pipe()
writeall(awkcmd, io)
but didn't get further than
ERROR: could not spawn `awk -F, -v OFS=, '{print $1, $2, $4}' trace-0.txt`:
invalid argument (EINVAL)
in _jl_spawn at process.jl:217
in spawn at process.jl:348
in writesto at process.jl:410
Just doing run(awkcmd) works fine. How do I do this?
// T
On Tuesday, May 27, 2014 5:31:41 PM UTC+2, Tomas Lycken wrote:
>
> I started a Julia script that processes a very large set of data, by
> reading a large number (100k) of quite small text files, doing some
> calculations, and aggregating the results. After running for a while I've
> noticed that there seems to be some memory management issues, that I
> suspect are just inefficient garbage collection. With some pseudo-elements,
> my script does something like this:
>
> function process_all_the_stuff()
> results1 = Float64[]
> results2 = Float64[]
> for i in 1:1e5
> thisdata = read_text_file_with_index(i)
> thisresult1 = do_calculation_1(thisdata)
> thisresult2 = do_calculation_2(thisdata)
> push!(results1, thisresult1)
> push!(results2, thisresult2)
> end
> results1, results2
> end
>
> I've come about halfway, and htop looks like this:
>
>
> <https://lh3.googleusercontent.com/-rFSwZ9UtvIg/U4SvG5EL4xI/AAAAAAAAAMY/QYYbNCv-6l0/s1600/htop.png>
>
> As you see, I'm about to run out of memory. Is there any way I can
> "inject" a call to gc(), say, at the end of the loop body, without
> interrupting the script and loosing all the work done so far? Or will Julia
> do so, when (if) she realizes memory is (too) scarce?
>
> If there isn't a way to do this, see this as the first step toward a
> feature request :P
>
> // Tomas
>