I see nothing glaringly inefficient in the Perl. This would be fine on your system if you were dealing with 1 million items, but you could easily be pushing up against your system's limits with the generic data structures that Perl uses, especially since Perl is probably using 64-bit floats and ints, and storing the hash keys twice (because you have to hashes).
You could try to use the Perl Data Language, PDL, to create large typed arrays with minimal overhead. However, I think a more Perlish approach would be to use a single hash to store the data, as you do (or maybe using pack/unpack to store the data using 32-bit floats and integers). Then instead of using sort, run through the whole collection and build your own top-20 list (or 50 or whatever) by hand. This way the final process of picking out the top 20 doesn't allocate new storage for all 80 million items. Does that make sense? I could bang out some code illustrating what I mean if that would help. David On Sun, Apr 17, 2022, 5:33 AM wilson <i...@bigcount.xyz> wrote: > hello the experts, > > can you help check my script for how to optimize it? > currently it was going as "run out of memory". > > $ perl count.pl > Out of memory! > Killed > > > My script: > use strict; > > my %hash; > my %stat; > > # dataset: userId, itemId, rate, time > # AV056ETQ5RXLN,0000031887,1.0,1397692800 > > open HD,"rate.csv" or die $!; > while(<HD>) { > my ($item,$rate) = (split /\,/)[1,2]; > $hash{$item}{total} += $rate; > $hash{$item}{count} +=1; > } > close HD; > > for my $key (keys %hash) { > $stat{$key} = $hash{$key}{total} / $hash{$key}{count}; > } > > my $i = 0; > for (sort { $stat{$b} <=> $stat{$a}} keys %stat) { > print "$_: $stat{$_}\n"; > last if $i == 99; > $i ++; > } > > The purpose is to aggregate and average the itemId's scores, and print > the result after sorting. > > The dataset has 80+ million items: > > $ wc -l rate.csv > 82677131 rate.csv > > And my memory is somewhat limited: > > $ free -m > total used free shared buff/cache > available > Mem: 1992 152 76 0 1763 > 1700 > Swap: 1023 802 221 > > > > What confused me is that Apache Spark can make this job done with this > limited memory. It got the statistics done within 2 minutes. But I want > to give perl a try since it's not that convenient to run a spark job > always. > > The spark implementation: > > scala> val schema="uid STRING,item STRING,rate FLOAT,time INT" > val schema: String = uid STRING,item STRING,rate FLOAT,time INT > > scala> val df = > spark.read.format("csv").schema(schema).load("skydrive/rate.csv") > val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... > 2 more fields] > > scala> > > df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show() > +----------+--------+ > > | item|avg_rate| > +----------+--------+ > |0001061100| 5.0| > |0001543849| 5.0| > |0001061127| 5.0| > |0001019880| 5.0| > |0001062395| 5.0| > |0000143502| 5.0| > |000014357X| 5.0| > |0001527665| 5.0| > |000107461X| 5.0| > |0000191639| 5.0| > |0001127748| 5.0| > |0000791156| 5.0| > |0001203088| 5.0| > |0001053744| 5.0| > |0001360183| 5.0| > |0001042335| 5.0| > |0001374400| 5.0| > |0001046810| 5.0| > |0001380877| 5.0| > |0001050230| 5.0| > +----------+--------+ > only showing top 20 rows > > > I think my perl script should be possible to be optimized to run this > job as well. So ask for your helps. > > Thanks in advance. > > wilson > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > >