If splitting the data into files is useful and necessary, and I agree
that keeping file sizes under a GB sounds nice, then it's got to be
split somehow. Might as well split on some natural dimension (user ID
or something) rather than randomly chunking. The distribution concerns
are no greater if it's a concern, and if they're not, is a
convenience.

On Mon, Apr 26, 2010 at 8:14 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Exactly.  I would find skewed data a pain the butt for statistical analysis.
>

Reply via email to