Re: [DOTNET] Using BitVector32

Ed Stegman Sun, 14 Apr 2002 19:59:20 -0700

You didn't mention if the string is comprised of one word, or of a document containing 
many words. If it's many words then I'd take
a different approach.


Use 3 BitVector[] arrays.

1. BitVector[] wordsIndex
Each bit corresponds to a word in your doc. If the word contains any upper case 
characters set the bit to 1. Most documents have a
small percentage of words with uppercase characters. This array keeps you from having 
to allocate and map BitVectors for words w/out
any upper case characters, and will save you tons of cycles when you restore the upper 
case characters later. You simply skip
processing for any words associated with a zero bit. :-))

2. BitVector[] firstOnly
Each bit corresponds to a set bit (value == 1) in the wordsIndex[] array above. If the 
only upper case character is the first
character, set the bit to 1. The vast majority of words with upper case characters are 
words where only the first character is upper
case. No need to create and map a BitVector for each word if you know that all you 
have to do is convert the first character.  When
restoring, if the bit corresponding to the word is set to 1, then upper case the first 
character and move on to the next word.

3. BitVector[] mappedWords
Your original concept in action. Each BitVector corresponds to a 0 bit in the 
firstOnly[] array above.

I'll use paragraph #3 above to illustrate the concept:

There are 19 words.

The wordsIndex[] array will hold just a single BitVector because there are only 19 
words. Only the 19 least significant bits in the
BitVector have meaning. I'm leaving off the unused most significant bits for clarity. 
(Least significant bit == 1st word in
sentence.)

WordsIndex[0] = 001 0000 0001 1000 0111 // binary

The firstOnly[] array will also hold just a single BitVector because there are only 6 
words with upper case letters. Only the 6
least significant bits have any meaning.

firstOnly[0] = 00 1100 // binary

The mappedWords[] array will hold 4 BitVectors. One for each bit in firstOnly with a 
zero value.
mappedWords[0] = 0 0000 1001 // "BitVector"
mappedWords[1] = 000 0100 0000 // "mappedWords"
mappedWords[2] = 0 0000 1001 // "BitVector"
mappedWords[3] = 0 0010 0000// "firstOnly"


Savings: Mapped 4 words instead of 19. That's pretty close to 80% savings in complete 
word mappings. In a larger, more typical
document the savings would be even greater because the example I used here included 4 
camel cased variable names. Had I used the
first paragraph from your post below as an example, the mappedWords[] array would only 
contain 1 element. :-))

Keep Smilin'
Ed Stegman


-----Original Message-----
From: Mattias Konradsson

I have a solution where I need to store the casing of a string separately
from the string in the most efficient and least memory consuming way
possible, and later be able to take an all lowercase string and by applying
the casing getting the original string back, like "TestinG" is stored with
casing bits of "1000001".

I thought the smartest way was to use a bitflags where a 1 represented
uppercase and 0 lowercase, and I took a look in the documentation and found
BitVector32 which seemed appropriate, however I'm getting some weird results
so I'm probably not getting the whole section/mask hoopla, here's some
sample code


 public string ApplyCasing(string inputStr,BitVector32 casing)
  {
   string result = "";
   for(int i=0;i<inputStr.Length;i++)
    {
     if (casing[i])
      result += Char.ToUpper(inputStr[i]);
     else
      result += inputStr[i];

    }
   return result;

  }


 public void Page_Load(Object sender, EventArgs e)
  {
   BitVector32 vector = new BitVector32(0);
   string testing = "TestinG";
   string testing1 = "testing";


   for(int i=0;i<testing.Length;i++)
    {
     if (!Char.IsLower(testing[i]))
      {
       vector[i] =  true;

      }
     else
      {
       vector[i] =  false;
      }
    }

   Response.Write(ApplyCasing(testing1,vector));
  }


which unfortunately outputs "TeStInG " not "TestinG". So my question is:
what am I doing wrong, and is Bitvector the smartest way to store casing
information like this?

Best regards
----
Mattias Konradsson
"Reinventing the wheel since 1977"

You can read messages from the DOTNET archive, unsubscribe from DOTNET, or
subscribe to other DevelopMentor lists at http://discuss.develop.com.

Re: [DOTNET] Using BitVector32

Reply via email to