GH's contribution seems to be not particularly relevant.

However, I may add that in PL/I one trivial statement will count
the number of words when any particular character (such as blank)
is the delimiter.
To permit bunches of other characters to be delimiters, a prior call to the
TRANSLATE built-in would be required to convert those
to a single charcter (such as blank).


----- Original Message ----- From: "glen herrmannsfeldt" <[email protected]>
To: <[email protected]>
Sent: Friday, June 15, 2018 1:33 PM


Nothing against discussions on how to write fast code, but I don’t believe that this is normally necessary.

About 20 years ago, I was counting words, not just how many, but how many of each word, on gigabytes of text. (Full text US patents for two years.) I did it in Java (with JIT compiler on), and it was plenty fast enough.

I did it using the Java StringTokenizer: https://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

which takes a regular expression for the delimiter. Then each word found was either added to a HashTable,
or the count for it was incremented.

As computers are much faster now, it should be able to do terabytes of text, 
today.

There was one non-obvious thing about the Java code, though.  It seems that the 
way Java
normally does substrings is with a reference to the whole character array, 
which in my case
was a line of text.  That filled up memory faster than it should have.  Using 
new String() on
each word, fixed that problem.  (It only does that for the actual entry in the 
hash table.)

But if you do have exabytes of text, then there might be need for assembly 
speed-up.
Well, OK, petabytes are enough.

Oh, you might also look at the unix wc command, which counts words.  (More 
specifically,
the GNU utilities version, with source available.)

About 25 years ago, I compiled the GNU utilities (as they then existed) to run 
on my OS/2
system. (That is before Linux, and such, that are so convenient today.)

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Reply via email to