> Hello all, I have a file of 3,210,008 CSV
> records. I need to take a random sample of
> this. I tried hacking something together a
> while ago, but it seemed to repeat 65,536
> different records. When I need a 5mil
> sample, this creates a problem.
> 
> Here is my old code: I know the logic
> allows dups, but what would incur the
> limit?  I think with 500,000 samples
> there wouldn't be a problem getting more
> than 65536 diff records, but that number
> is too ironic for me to deal with.

Don't laugh too much, but if still having
data in the same order it was extracted,
then this will probably surfice:

open (FILE,"consumer.sample.sasdump.txt");
open (NEW,">consumer.new");

my $probability;

while (<FILE>) {
    print NEW if $probability > rand;
} 

close(FILE);
close(NEW);

__END__

Even if it doesn't it solves the problem of having
duplicates.  Then you can shuffle elements to get
your data set.  There must be a decent shuffle
algorithm someplace, since I haven't thought of
one yet.  splicing to pop elements midway
off an array just doesn't appeal.

The limit at 65536 is probably a bug, it should be
much higher than that :)  On the other hand, maybe
they never thought someone would be using such big
files with that feature.

Anyway, take care,

Jonathan Paton

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to