Am 19.08.12 06:59, schrieb tamouse mailing lists: > On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston > <[email protected]> wrote: >> I want to parse this text and count the occurrence of each word: >> >> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html; >> #Can I do this? >> $stripping = strip_tags($text); #get rid of html >> $stripping = strtolower($stripping); #put in lowercase >> >> ---------------- >> First of all I want to start AFTER the expression "News Releases" and stop >> BEFORE the next occurrence of "-30-" >> >> #This may occur an undetermined number of times on >> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html >> >> >> ---------------- >> Second, do I put $stripping into an array to separate each word by each >> space " "? >> >> $stripping = implode(" ", $stripping); >> >> ---------------- >> Third how do I count the number of occurrences of each word? >> >> Sample Output: >> >> determined = 4 >> fire = 7 >> patrol = 3 >> theft = 6 >> witness = 1 >> witnessed = 1 >> >> ---------------- >> <?php >> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html >> #echo strip_tags($text); >> #echo "\n"; >> $stripping = strip_tags($text); >> >> #Get text between "News Releases" and stop before the next occurrence of >> "-30-" >> >> #$stripping = str_replace("\r", " ", $stripping);# getting rid of \r >> #$stripping = str_replace("\n", " ", $stripping);# getting rid of \n >> #$stripping = str_replace(" ", " ", $stripping);# getting rid of the >> occurrences of double spaces >> >> #$stripping = strtolower($stripping); >> >> #Where do I go now? >> ?> >> >> >> -- >> PHP General Mailing List (http://www.php.net/) >> To unsubscribe, visit: http://www.php.net/unsub.php >> > This is usually a first-year CS programming problem (word frequency > counts) complicated a little bit by needing to extract the text. > You've started off fine, stripping tags, converting to lower case, > you'll want to either convert or strip HTML entities as well, deciding > what you want to do with plurals and words like "you're", "Charlie's", > "it's", etc, also whether something like RFC822 is a word or not > (mixed letters and numbers). > > When you've arranged all that, splitting on white space is trivial: > > $words = preg_split('/[[:space:]]+/',$text); > > and then you just run through the words building an associative array > by incrementing the count of each word as the key to the array: > > foreach ($words as $word) { > $freq[$word]++; > }
Please an existence check to avoid incrementing not set array keys
foreach ($words as $word) {
if (array_key_exists($word, $freq)) {
$freq[$word] = 1;
} else {
$freq[$word]++;
}
}
>
> For output, you may want to sort the array:
>
> ksort($freq);
>
--
Marco Behnke
Dipl. Informatiker (FH), SAE Audio Engineer Diploma
Zend Certified Engineer PHP 5.3
Tel.: 0174 / 9722336
e-Mail: [email protected]
Softwaretechnik Behnke
Heinrich-Heine-Str. 7D
21218 Seevetal
http://www.behnke.biz
signature.asc
Description: OpenPGP digital signature

