In another thread I mentioned looking at some largish data sets (10's of
gigabytes in the form of a .csv file). I have made some progress treating
the file as a memory-mapped Uint8 array but that hasn't been as effective
as I would have hoped. Using a shared array and multiple processes seems
an effective way to parallelize the initial reduction of the .csv file.
The best way I have come up with of getting a file's contents as a shared
array is
sm = convert(SharedArray, open(readbytes,"./kaggle/trainHistory.csv"))
It would be convenient to process the contents on line boundaries. I can
determine suitable ranges with something like
function blocks(v::SharedVector{Uint8})
np = length(v.pids)
len = length(v)
bsz = div(len,np)
blks = Array(UnitRange{Int},np)
low = 1
for i in 1:np-1
eolpos = findnext(v, '\n', i*bsz)
blks[i] = UnitRange(low, eolpos)
low = eolpos + 1
end
blks[np] = UnitRange(low,len)
blks
end
which in this case produces
julia> blocks(sm)
8-element Array{UnitRange{Int64},1}:
1:794390
794391:1588775
1588776:2383151
2383152:3177538
3177539:3971942
3971943:4766322
4766323:5560686
5560687:6355060
(This is a smaller file that I am using for testing. The real files are
much larger.)
These blocks will be different from what I would get with
sm.loc_subarray_1d. It seems to me that I should be able to use these
blocks rather than the .loc_subarray_1d blocks if I do enough juggling with
@spawnat, fetch, etc. Is there anything that would stand in the way of
doing so?