Hi Tim, All data in a table for a given column family will be stored together on disk. Depending on your DFS blocksize, they will be fetched from disk in increments of 64MB (Hadoop default) or 8MB (HBase recommended value), etc. It stands to reason that the more values you can pack into a block, the more efficient your scans will be. I would not expect much benefit for random read usage patterns.
Taking that to a logical conclusion, you may want to enable block compression for the given table and column family or families. However at this time enabling compression is not recommended. It is not well tested and may contribute to out of memory conditions under high load. Also, smaller values will require fewer bytes to transport from the regionserver to the client via RPC. Another question I would ask myself is the following: Would the compact representation levy a tax on client side processing? If so, will it take back any gains achieved at disk or RPC? Hope that helps, - Andy > From: tim robertson <[email protected]> > Subject: Column types - smaller the better? > To: [email protected] > Date: Saturday, December 27, 2008, 9:33 AM > Hi all, > > Beginner question, but does it make sense to use the > smallest data type you can in HBase? > > Is there much performance gain over say 1 Billion records > saving new Integer(1) instead of new > String("observation") ? > > I am proposing to parse one column family into a new > "parsed values" family, which would be these integer > style types. If my guess is > correct then there will be more rows in one region (correct > terminology?) and therefore less shuffling around and > faster scanning. Or am I way off the mark? > > Cheers, > > Tim
