Hi,

I have a dataset of many (about 30 million) observations of the type

Tuple{Person, Array{DataA,1}, Array{DataB,1}}

where

immutable Person # simplified
  id::Int32
  female::Bool
  age::Int8
end

immutable DataA
  startdate::Int32
  enddate::Int32
  type::Int8
  extra:UInt8
end

and DataB is similar but different. The vectors of DataA and DataB have
about 2-15 elements, varies between observations.

I would like to dump this data in the most compact format. The intention
is to read it later, and produce various summary statistics for each
observation, ie a mostly linear traversal of this file many time, but...

... it would be great if I could look up an observation with a
particular id field in Person without scanning the whole file, but this
is not absolutely necessary if the overhead would be large.

Does not matter if it is not compatible between Julia versions, given
the data size I just want to save space (data can be regenerated before
dumping, takes a few hours).

The question is: what do you recommend in Julia? A "homebrew database"
would be

1. opening a stream with GZip.open,
2. using write to dump data,
3. optionally save file positions for each observation and save that in
a different file.

This is not much to implement, however if I could get this with a bit of
tweaking using something more standard, I would be happier. I could be
using HDF5, but I am concerned about the overhead (did not try).

Thanks,

Tamas

Reply via email to