> Until compression is super solid, I would be wary of storing text (xml, html, > etc) > in hbase due to size concerns.
Hmmm... Where do the indexing guys store their raw harvested records / HTML / whatever then? I guess mine would be coming in at 200G as text or so, per 100M records (maybe looking to 1Billion records over next 24 months). Can someone suggest a better place to store the records if not HBase? I want to be able to serve them as cached records, and also use them as sources for new indexes, without harvesting again. This is classic use case of HBase I thought... I mean, it is even on the HBase architecture page as the example table structure: http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture. Bit surprised to hear it is not recommended use. Cheers for pointers and sorry for the question bombardment - just trying to catch up. Tim On Thu, Jan 15, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote: > I think you were referring to my presentation. > > I was importing a CSV file, of 6 integers. Obviously in the CSV file, the > integers were their ASCII representation. So my code had to atoi() the > strings, then pack them into Thrift records, serialize those, and finally > insert the binary thrift rep into hbase with a key. > > I had 3 versions: > - thrift gateway - this was the slowest, doing 20m records in 6 hours. The > init code looks like: > transport = TSocket.TSocket(hbaseMaster, hbasePort) > transport = TTransport.TBufferedTransport(transport) > protocol = TBinaryProtocol.TBinaryProtocol(transport) > client = Hbase.Client(protocol) > transport.open() > > So using buffered transport, but no specific hbase API calls to set auto > flush or other params. This is in CPython. > > - HBase API version #1: > Written in Jython, this is substantially faster, doing 20m records in 70 > minutes, or 4 per ms. This performance scales up to at least 6 processes. > > - HBase API version #2: > Slightly smarter, I now call: > table.setAutoFlush(False) > table.setWriteBufferSize(1024*1024*12) > > And my speed jumps up to between 30-50 inserts per ms, scaling to at least 6 > concurrent processes. > > I then rewrote this stuff into a map-reduce and I can now insert 440m > records in about 70-80 minutes. > > As I move forward, I will be emulating bigtable and using either thrift > serialized records or protobufs to store data in cells. This allows you to > forward/backwards compatiblly extend data within individual cells. Until > compression is super solid, I would be wary of storing text (xml, html, etc) > in hbase due to size concerns. > > > The hardware: > - 4 cpu, 128 gb ram > - 1 tb disk > > Here are some relevant configs: > hbase-env.sh: > export HBASE_HEAPSIZE=5000 > > hadoop-site.xml: > <property> > <name>dfs.datanode.socket.write.tiemout</name> > <value>0</value> > </property> > > <property> > <name>dfs.datanode.max.xcievers</name> > <value>2047</value> > </property> > > <property> > <name>dfs.datanode.handler.count</name> > <value>10</value> > </property> > > > > > > > On Wed, Jan 14, 2009 at 11:30 PM, tim robertson > <[email protected]>wrote: > >> Hi all, >> >> I was skyping in yesterday from Europe. >> Being half asleep and on a bad wireless, it was not too easy to hear >> sometimes, and I have some quick questions to the person who was >> describing his tab file (CSV?) loading at the beginning. >> >> Could you please summarise quickly again the stats you mentioned? >> Number rows, size file size pre loading, was it 7 Strings? per row, >> size after load, time to load etc >> >> Also, could you please quickly summarise your cluster hardware (spec, >> ram + number nodes)? >> >> What did you find sped it up? >> >> How many columns per family were you using and did this affect much >> (presumably less mean fewer region splits right?) >> >> The reason I ask is I have around 50G in tab file (representing 162M >> rows from mysql with around 50 fields - strings of <20 chars and int >> mostly) and will be loading HBase with this. Once this initial import >> is done, I will then harvest XML and Tab files into HBase directly >> (storing the raw XML record or tab file row as well). >> I am in early testing (awaiting hardware and fed up using EC2) so >> still running code on laptop and small tests. I have 6 dell boxes (2 >> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what >> performance I will get. >> >> Thanks, >> >> Tim >> >
