On Tuesday, October 11, 2016 at 3:39:55 PM UTC+2, Tamas Papp wrote:
> I have a dataset of many (about 30 million) observations of the type 
> Tuple{Person, Array{DataA,1}, Array{DataB,1}} 
> where 
> immutable Person # simplified 
> immutable DataA 
> I would like to dump this data in the most compact format.

do you need a single file? because it could be very efficiently stored in 3 
files, person, dataA and dataB in the following logic:

file person would be a big array of persons, plus two pointers to the dataA 
and dataB file, indicating the index of the first dataA that belongs to 
this person.

dataA file would be a big array of dataA records, starting with N records 
of the first person, then the M records of the second person, etc.

you can access such files through Mmap.mmap very effectively. during 
writing, you need to keep track of cumulative number of dataA and dataB 
records. and the person file is random access, as per your request.

you can replace the dataA/B pointers with record counts, which can be Int8 
in your case, saving extra bytes. but doing so prevents random access.

not sure i made myself clear. let me know if not.

