At the user meeting last night, stack noted that since lots of us "lads" are noting performance improvement on random-read when we use compression, that perhaps a fresh look at making compression solid would be a good thing.
Personally I am just obsessed with on-disk efficiency. But also, I am chasing after random-read performance latencys so I can serve a website out of hbase... if that isnt your needs, then perhaps what you want to do would be just fine as it? -ryan On Thu, Jan 15, 2009 at 2:11 PM, tim robertson <[email protected]>wrote: > > Until compression is super solid, I would be wary of storing text (xml, > html, etc) > > in hbase due to size concerns. > > Hmmm... Where do the indexing guys store their raw harvested records / > HTML / whatever then? > > I guess mine would be coming in at 200G as text or so, per 100M > records (maybe looking to 1Billion records over next 24 months). Can > someone suggest a better place to store the records if not HBase? I > want to be able to serve them as cached records, and also use them as > sources for new indexes, without harvesting again. This is classic > use case of HBase I thought... I mean, it is even on the HBase > architecture page as the example table structure: > http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture. Bit surprised > to hear it is not recommended use. > > Cheers for pointers and sorry for the question bombardment - just > trying to catch up. > > Tim > > > > > > > > On Thu, Jan 15, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote: > > I think you were referring to my presentation. > > > > I was importing a CSV file, of 6 integers. Obviously in the CSV file, > the > > integers were their ASCII representation. So my code had to atoi() the > > strings, then pack them into Thrift records, serialize those, and finally > > insert the binary thrift rep into hbase with a key. > > > > I had 3 versions: > > - thrift gateway - this was the slowest, doing 20m records in 6 hours. > The > > init code looks like: > > transport = TSocket.TSocket(hbaseMaster, hbasePort) > > transport = TTransport.TBufferedTransport(transport) > > protocol = TBinaryProtocol.TBinaryProtocol(transport) > > client = Hbase.Client(protocol) > > transport.open() > > > > So using buffered transport, but no specific hbase API calls to set auto > > flush or other params. This is in CPython. > > > > - HBase API version #1: > > Written in Jython, this is substantially faster, doing 20m records in 70 > > minutes, or 4 per ms. This performance scales up to at least 6 > processes. > > > > - HBase API version #2: > > Slightly smarter, I now call: > > table.setAutoFlush(False) > > table.setWriteBufferSize(1024*1024*12) > > > > And my speed jumps up to between 30-50 inserts per ms, scaling to at > least 6 > > concurrent processes. > > > > I then rewrote this stuff into a map-reduce and I can now insert 440m > > records in about 70-80 minutes. > > > > As I move forward, I will be emulating bigtable and using either thrift > > serialized records or protobufs to store data in cells. This allows you > to > > forward/backwards compatiblly extend data within individual cells. Until > > compression is super solid, I would be wary of storing text (xml, html, > etc) > > in hbase due to size concerns. > > > > > > The hardware: > > - 4 cpu, 128 gb ram > > - 1 tb disk > > > > Here are some relevant configs: > > hbase-env.sh: > > export HBASE_HEAPSIZE=5000 > > > > hadoop-site.xml: > > <property> > > <name>dfs.datanode.socket.write.tiemout</name> > > <value>0</value> > > </property> > > > > <property> > > <name>dfs.datanode.max.xcievers</name> > > <value>2047</value> > > </property> > > > > <property> > > <name>dfs.datanode.handler.count</name> > > <value>10</value> > > </property> > > > > > > > > > > > > > > On Wed, Jan 14, 2009 at 11:30 PM, tim robertson > > <[email protected]>wrote: > > > >> Hi all, > >> > >> I was skyping in yesterday from Europe. > >> Being half asleep and on a bad wireless, it was not too easy to hear > >> sometimes, and I have some quick questions to the person who was > >> describing his tab file (CSV?) loading at the beginning. > >> > >> Could you please summarise quickly again the stats you mentioned? > >> Number rows, size file size pre loading, was it 7 Strings? per row, > >> size after load, time to load etc > >> > >> Also, could you please quickly summarise your cluster hardware (spec, > >> ram + number nodes)? > >> > >> What did you find sped it up? > >> > >> How many columns per family were you using and did this affect much > >> (presumably less mean fewer region splits right?) > >> > >> The reason I ask is I have around 50G in tab file (representing 162M > >> rows from mysql with around 50 fields - strings of <20 chars and int > >> mostly) and will be loading HBase with this. Once this initial import > >> is done, I will then harvest XML and Tab files into HBase directly > >> (storing the raw XML record or tab file row as well). > >> I am in early testing (awaiting hardware and fed up using EC2) so > >> still running code on laptop and small tests. I have 6 dell boxes (2 > >> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what > >> performance I will get. > >> > >> Thanks, > >> > >> Tim > >> > > >
