Here is the perl subroutine: 
 sub count_bigrams_from_idlist {
     my ($list_filename, $fanfic_dir, $dest_dir)  = @_;
     open(my $in, "<$list_filename") or die "Could not open list_filename, 
$!\n";
     my $short_filename = basename($list_filename, ".txt");
     print("$short_filename\n");
     my @filenames = ();
     while(my $line = <$in>) {
         chomp($line);
         my $filename = "$fanfic_dir/$line.txt";
         push(@filenames, $filename);
     }
     close $in;
     system($^X, "/Users/cat/perl5/bin/huge-count.pl", 
"--token=valid_tokens.txt",
     "--tokenlist", $dest_dir, @filenames);
     rename("$dest_dir/huge-count.pl", "${short_filename}_count.txt");
 }
 

 $list_filename is a file where each line contains the names of a file to read 
in (let's call it a list file). The number of lines (i.e., the number of files 
to group together) varies. Of the list files I've tried so far, the biggest has 
81101 lines (that failed to run because there were too many arguments). I know 
that's not my highest number, but I haven't counted for all the list files so I 
don't know what my highest number is. The lowest I've seen is 245 lines (that 
ran fine). The lowest I've seen that failed to run is 5592 lines. Is that 
enough information to get a sense of what I am trying to do?
 

 Now that I think about it, I have a related question about using 
huge-count.pl. How can I filter out frequencies at the combine step? If I don't 
need an ngram to have a frequency greater than one in every file, but I want it 
to have a total frequency greater than one across all the files, how would I 
filter that?
 


 


Reply via email to