Powerset (and others) are profligate and store the content uncompressed (I can feel Ryan 'wincing').

General message on compressed data is that its lightly tested. There may be issues yet to surface. Be wary. If you trip over any, surface them so we can get them fixed especially as Ryan and others are starting to report higher throughput when data is compressed (makes sense).

Thanks 'lads',
St.Ack


Ryan Rawson wrote:
At the user meeting last night, stack noted that since lots of us "lads" are
noting performance improvement on random-read when we use compression, that
perhaps a fresh look at making compression solid would be a good thing.

Personally I am just obsessed with on-disk efficiency.  But also, I am
chasing after random-read performance latencys so I can serve a website out
of hbase... if that isnt your needs, then perhaps what you want to do would
be just fine as it?

-ryan

On Thu, Jan 15, 2009 at 2:11 PM, tim robertson <[email protected]>wrote:

Until compression is super solid, I would be wary of storing text (xml,
html, etc)
in hbase due to size concerns.
Hmmm... Where do the indexing guys store their raw harvested records /
HTML / whatever then?

I guess mine would be coming in at 200G as text or so, per 100M
records (maybe looking to 1Billion records over next 24 months).  Can
someone suggest a better place to store the records if not HBase?  I
want to be able to serve them as cached records, and also use them as
sources for new indexes, without harvesting again.  This is classic
use case of HBase I thought... I mean, it is even on the HBase
architecture page as the example table structure:
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture.  Bit surprised
to hear it is not recommended use.

Cheers for pointers and sorry for the question bombardment - just
trying to catch up.

Tim







On Thu, Jan 15, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote:
I think you were referring to my presentation.

I was importing a CSV file, of 6 integers.  Obviously in the CSV file,
the
integers were their ASCII representation.  So my code had to atoi() the
strings, then pack them into Thrift records, serialize those, and finally
insert the binary thrift rep into hbase with a key.

I had 3 versions:
- thrift gateway - this was the slowest, doing 20m records in 6 hours.
 The
init code looks like:
   transport = TSocket.TSocket(hbaseMaster, hbasePort)
   transport = TTransport.TBufferedTransport(transport)
   protocol = TBinaryProtocol.TBinaryProtocol(transport)
   client = Hbase.Client(protocol)
   transport.open()

So using buffered transport, but no specific hbase API calls to set auto
flush or other params. This is in CPython.

- HBase API version #1:
Written in Jython, this is substantially faster, doing 20m records in 70
minutes, or 4 per ms.  This performance scales up to at least 6
processes.
- HBase API version #2:
Slightly smarter, I now call:
table.setAutoFlush(False)
table.setWriteBufferSize(1024*1024*12)

And my speed jumps up to between 30-50 inserts per ms, scaling to at
least 6
concurrent processes.

I then rewrote this stuff into a map-reduce and I can now insert 440m
records in about 70-80 minutes.

As I move forward, I will be emulating bigtable and using either thrift
serialized records or protobufs to store data in cells.  This allows you
to
forward/backwards compatiblly extend data within individual cells.  Until
compression is super solid, I would be wary of storing text (xml, html,
etc)
in hbase due to size concerns.


The hardware:
- 4 cpu, 128 gb ram
- 1 tb disk

Here are some relevant configs:
hbase-env.sh:
export HBASE_HEAPSIZE=5000

hadoop-site.xml:
<property>
<name>dfs.datanode.socket.write.tiemout</name>
<value>0</value>
</property>

<property>
<name>dfs.datanode.max.xcievers</name>
<value>2047</value>
</property>

<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
</property>






On Wed, Jan 14, 2009 at 11:30 PM, tim robertson
<[email protected]>wrote:

Hi all,

I was skyping in yesterday from Europe.
Being half asleep and on a bad wireless, it was not too easy to hear
sometimes, and I have some quick questions to the person who was
describing his tab file (CSV?) loading at the beginning.

Could you please summarise quickly again the stats you mentioned?
Number rows, size file size pre loading, was it 7 Strings? per row,
size after load, time to load etc

Also, could you please quickly summarise your cluster hardware (spec,
ram + number nodes)?

What did you find sped it up?

How many columns per family were you using and did this affect much
(presumably less mean fewer region splits right?)

The reason I ask is I have around 50G in tab file (representing 162M
rows from mysql with around 50 fields - strings of <20 chars and int
mostly) and will be loading HBase with this.  Once this initial import
is done, I will then harvest XML and Tab files into HBase directly
(storing the raw XML record or tab file row as well).
I am in early testing (awaiting hardware and fed up using EC2) so
still running code on laptop and small tests.  I have 6 dell boxes (2
proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what
performance I will get.

Thanks,

Tim



Reply via email to