On Sun, Apr 04, 2010 at 02:31:47PM -0700, drew wymore wrote: > On Sun, Apr 4, 2010 at 12:31 PM, Michael Rasmussen <[email protected]> wrote: > > > > On Sun, Apr 04, 2010 at 12:10:03PM -0700, drew wymore wrote: > >> I have a large data set that is being exported from an Oracle DB, > >> unfortunately I can't work with the data directly in Oracle or this > >> wouldn't be a problem. I can export it as CSV and work with it. > >> ... I don't really care which language I > >> do it in and whether I do it directly from csv or a database source > >> other than Oracle (because I can't). > >> > >> Any clue sticks, ideas or links to something that might help me solve > >> this problem appreciated. > > > > With apologies to Randal... > > > > Assume you export to CSV and, for the purposes of this simple example there > > are no text fields that have commas embedded. > > > > And if the data of interest is in the third column: > > > > 3,14,word,blah,blech,bz > > 4,18,term,more,stuff > > > > then: > > > > perl -ne '@F=split /,/; $words{$F[2]}++; \ > > END{ foreach $word (sort { $words{$a} <=> $words{$b} } keys %words) \ > > { print "$word\t$word_appearance{$word}\n"; } } ' file_of_data.cvs > > > > Assuming you want it sorted by word frequency. > > > > Disclaimer: I'm at my in-laws for easter dinner and didn't test that. > > I'm reasonably sure that it's close enough that any gaps will serve > > as an exercise for the reader. > > > > -- > > Michael Rasmussen, Portland Oregon > > Trading kilograms for kilometers since 2003 > > Be appropriate && Follow your curiosity > > http://www.jamhome.us/ > > The Fortune Cookie Fortune today is: > > At once it struck me what quality went to form a man of achievement, > > especially in literature, and which Shakespeare possessed so enormously > > -- I mean negative capability, that is, when a man is capable of being > > in uncertainties, mysteries, doubts, without any irritable reaching > > after fact and reason. > > -- John Keats > > _______________________________________________ > > PLUG mailing list > > [email protected] > > http://lists.pdxlinux.org/mailman/listinfo/plug > > > > > Thanks Rich and Michael. I'll give the perl a shot and see what > happens. As far as the data layout. It's 5 columns with roughly 1100 > rows, the column I'm interested in has a variable number of words per > entry but doesn't exceed a couple hundred words.
Ah, so you'll need to parse that bit too. Since you're interested in a word count do you need the rest of the row data? If not, why not just export the column of interest? Assuming that you want the rest of it, save yourself headaches and visit CPAN.org for Pase::CSV http://search.cpan.org/~adamk/Parse-CSV-1.00/lib/Parse/CSV.pm So something like this may help: #!/usr/bin/perl while(<>) { # simplistic, since the fields may also have commas # use Parse::CSV for real life stuff @db_csv_fields = split /,/; # from problem description the data of interest in in the 3rd field # and, ahem, we consider a "word" any character sequence that is not whitespace @words = split /\s+/, $db_csv_fields[2]; foreach $w (@words) { $word_count{$w}++; } } # assuming you want the results sorted ascending by frequency foreach $word ( sort { $word_count{$a} <=> $word_count{$b} } keys %word_count) { print "$word\t$word_count{$word}\n"; } > I did enable fulltext searching within mysql which works fine for > searching but doesn't give me the flexibility I'm looking for to > actually just get a count of unique words. I did find something in PHP > that is supposed to work but it's barfing on the array that's being > returned by the mysql query. > > Drew- > _______________________________________________ > PLUG mailing list > [email protected] > http://lists.pdxlinux.org/mailman/listinfo/plug > -- Michael Rasmussen, Portland Oregon Trading kilograms for kilometers since 2003 Be appropriate && Follow your curiosity http://www.jamhome.us/ The Fortune Cookie Fortune today is: Never be led astray onto the path of virtue. _______________________________________________ PLUG mailing list [email protected] http://lists.pdxlinux.org/mailman/listinfo/plug
