If splitting the data into files is useful and necessary, and I agree that keeping file sizes under a GB sounds nice, then it's got to be split somehow. Might as well split on some natural dimension (user ID or something) rather than randomly chunking. The distribution concerns are no greater if it's a concern, and if they're not, is a convenience.
On Mon, Apr 26, 2010 at 8:14 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Exactly. I would find skewed data a pain the butt for statistical analysis. >