> Until compression is super solid, I would be wary of storing text (xml, html, 
> etc)
> in hbase due to size concerns.

Hmmm... Where do the indexing guys store their raw harvested records /
HTML / whatever then?

I guess mine would be coming in at 200G as text or so, per 100M
records (maybe looking to 1Billion records over next 24 months).  Can
someone suggest a better place to store the records if not HBase?  I
want to be able to serve them as cached records, and also use them as
sources for new indexes, without harvesting again.  This is classic
use case of HBase I thought... I mean, it is even on the HBase
architecture page as the example table structure:
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture.  Bit surprised
to hear it is not recommended use.

Cheers for pointers and sorry for the question bombardment - just
trying to catch up.

Tim







On Thu, Jan 15, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote:
> I think you were referring to my presentation.
>
> I was importing a CSV file, of 6 integers.  Obviously in the CSV file, the
> integers were their ASCII representation.  So my code had to atoi() the
> strings, then pack them into Thrift records, serialize those, and finally
> insert the binary thrift rep into hbase with a key.
>
> I had 3 versions:
> - thrift gateway - this was the slowest, doing 20m records in 6 hours.  The
> init code looks like:
>    transport = TSocket.TSocket(hbaseMaster, hbasePort)
>    transport = TTransport.TBufferedTransport(transport)
>    protocol = TBinaryProtocol.TBinaryProtocol(transport)
>    client = Hbase.Client(protocol)
>    transport.open()
>
> So using buffered transport, but no specific hbase API calls to set auto
> flush or other params. This is in CPython.
>
> - HBase API version #1:
> Written in Jython, this is substantially faster, doing 20m records in 70
> minutes, or 4 per ms.  This performance scales up to at least 6 processes.
>
> - HBase API version #2:
> Slightly smarter, I now call:
> table.setAutoFlush(False)
> table.setWriteBufferSize(1024*1024*12)
>
> And my speed jumps up to between 30-50 inserts per ms, scaling to at least 6
> concurrent processes.
>
> I then rewrote this stuff into a map-reduce and I can now insert 440m
> records in about 70-80 minutes.
>
> As I move forward, I will be emulating bigtable and using either thrift
> serialized records or protobufs to store data in cells.  This allows you to
> forward/backwards compatiblly extend data within individual cells.  Until
> compression is super solid, I would be wary of storing text (xml, html, etc)
> in hbase due to size concerns.
>
>
> The hardware:
> - 4 cpu, 128 gb ram
> - 1 tb disk
>
> Here are some relevant configs:
> hbase-env.sh:
> export HBASE_HEAPSIZE=5000
>
> hadoop-site.xml:
> <property>
> <name>dfs.datanode.socket.write.tiemout</name>
> <value>0</value>
> </property>
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>2047</value>
> </property>
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
>
>
>
>
>
> On Wed, Jan 14, 2009 at 11:30 PM, tim robertson
> <[email protected]>wrote:
>
>> Hi all,
>>
>> I was skyping in yesterday from Europe.
>> Being half asleep and on a bad wireless, it was not too easy to hear
>> sometimes, and I have some quick questions to the person who was
>> describing his tab file (CSV?) loading at the beginning.
>>
>> Could you please summarise quickly again the stats you mentioned?
>> Number rows, size file size pre loading, was it 7 Strings? per row,
>> size after load, time to load etc
>>
>> Also, could you please quickly summarise your cluster hardware (spec,
>> ram + number nodes)?
>>
>> What did you find sped it up?
>>
>> How many columns per family were you using and did this affect much
>> (presumably less mean fewer region splits right?)
>>
>> The reason I ask is I have around 50G in tab file (representing 162M
>> rows from mysql with around 50 fields - strings of <20 chars and int
>> mostly) and will be loading HBase with this.  Once this initial import
>> is done, I will then harvest XML and Tab files into HBase directly
>> (storing the raw XML record or tab file row as well).
>> I am in early testing (awaiting hardware and fed up using EC2) so
>> still running code on laptop and small tests.  I have 6 dell boxes (2
>> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what
>> performance I will get.
>>
>> Thanks,
>>
>> Tim
>>
>

Reply via email to