"same representation" doesn't have to mean that the representation doesn't have magic internally.
It just means that if you put the same content into three different kinds of vectors, you plausibly ought to see roughly the same thing go out the wire. This is subject to a few caveats like the fact that a dense vector doesn't really know if it has only a few non-zero elements. I would be happy if the serialized form decided that it had lots of non-zeros and thus could do away with writing all of the indexes. It might also be that we should write the indexes using a compressed bit vector format such as a run-length encoding. That gives low overhead for very sparse and for very dense vectors. On Tue, Jan 5, 2010 at 8:38 AM, Jake Mannix <jake.man...@gmail.com> wrote: > > I would imagine the serialized form of a vector is the same for > > SparseVector, DenseVector, etc. There's no question of representation. > > You write out all the non-default elements. > > > > This will be twice as large in the dense case (there's no need to write out > indices). Ok, not twice as large but size() * (4 + 8) instead of size() * > 8. > That's a pretty significant cost in terms of disk space and IO time. -- Ted Dunning, CTO DeepDyve