Re: Question to speaker (tab file loading) at yesterdays user group

Ryan Rawson Thu, 15 Jan 2009 14:16:38 -0800

At the user meeting last night, stack noted that since lots of us "lads" are
noting performance improvement on random-read when we use compression, that
perhaps a fresh look at making compression solid would be a good thing.


Personally I am just obsessed with on-disk efficiency.  But also, I am
chasing after random-read performance latencys so I can serve a website out
of hbase... if that isnt your needs, then perhaps what you want to do would
be just fine as it?

-ryan

On Thu, Jan 15, 2009 at 2:11 PM, tim robertson <[email protected]>wrote:

> > Until compression is super solid, I would be wary of storing text (xml,
> html, etc)
> > in hbase due to size concerns.
>
> Hmmm... Where do the indexing guys store their raw harvested records /
> HTML / whatever then?
>
> I guess mine would be coming in at 200G as text or so, per 100M
> records (maybe looking to 1Billion records over next 24 months).  Can
> someone suggest a better place to store the records if not HBase?  I
> want to be able to serve them as cached records, and also use them as
> sources for new indexes, without harvesting again.  This is classic
> use case of HBase I thought... I mean, it is even on the HBase
> architecture page as the example table structure:
> http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture.  Bit surprised
> to hear it is not recommended use.
>
> Cheers for pointers and sorry for the question bombardment - just
> trying to catch up.
>
> Tim
>
>
>
>
>
>
>
> On Thu, Jan 15, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote:
> > I think you were referring to my presentation.
> >
> > I was importing a CSV file, of 6 integers.  Obviously in the CSV file,
> the
> > integers were their ASCII representation.  So my code had to atoi() the
> > strings, then pack them into Thrift records, serialize those, and finally
> > insert the binary thrift rep into hbase with a key.
> >
> > I had 3 versions:
> > - thrift gateway - this was the slowest, doing 20m records in 6 hours.
>  The
> > init code looks like:
> >    transport = TSocket.TSocket(hbaseMaster, hbasePort)
> >    transport = TTransport.TBufferedTransport(transport)
> >    protocol = TBinaryProtocol.TBinaryProtocol(transport)
> >    client = Hbase.Client(protocol)
> >    transport.open()
> >
> > So using buffered transport, but no specific hbase API calls to set auto
> > flush or other params. This is in CPython.
> >
> > - HBase API version #1:
> > Written in Jython, this is substantially faster, doing 20m records in 70
> > minutes, or 4 per ms.  This performance scales up to at least 6
> processes.
> >
> > - HBase API version #2:
> > Slightly smarter, I now call:
> > table.setAutoFlush(False)
> > table.setWriteBufferSize(1024*1024*12)
> >
> > And my speed jumps up to between 30-50 inserts per ms, scaling to at
> least 6
> > concurrent processes.
> >
> > I then rewrote this stuff into a map-reduce and I can now insert 440m
> > records in about 70-80 minutes.
> >
> > As I move forward, I will be emulating bigtable and using either thrift
> > serialized records or protobufs to store data in cells.  This allows you
> to
> > forward/backwards compatiblly extend data within individual cells.  Until
> > compression is super solid, I would be wary of storing text (xml, html,
> etc)
> > in hbase due to size concerns.
> >
> >
> > The hardware:
> > - 4 cpu, 128 gb ram
> > - 1 tb disk
> >
> > Here are some relevant configs:
> > hbase-env.sh:
> > export HBASE_HEAPSIZE=5000
> >
> > hadoop-site.xml:
> > <property>
> > <name>dfs.datanode.socket.write.tiemout</name>
> > <value>0</value>
> > </property>
> >
> > <property>
> > <name>dfs.datanode.max.xcievers</name>
> > <value>2047</value>
> > </property>
> >
> > <property>
> > <name>dfs.datanode.handler.count</name>
> > <value>10</value>
> > </property>
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 14, 2009 at 11:30 PM, tim robertson
> > <[email protected]>wrote:
> >
> >> Hi all,
> >>
> >> I was skyping in yesterday from Europe.
> >> Being half asleep and on a bad wireless, it was not too easy to hear
> >> sometimes, and I have some quick questions to the person who was
> >> describing his tab file (CSV?) loading at the beginning.
> >>
> >> Could you please summarise quickly again the stats you mentioned?
> >> Number rows, size file size pre loading, was it 7 Strings? per row,
> >> size after load, time to load etc
> >>
> >> Also, could you please quickly summarise your cluster hardware (spec,
> >> ram + number nodes)?
> >>
> >> What did you find sped it up?
> >>
> >> How many columns per family were you using and did this affect much
> >> (presumably less mean fewer region splits right?)
> >>
> >> The reason I ask is I have around 50G in tab file (representing 162M
> >> rows from mysql with around 50 fields - strings of <20 chars and int
> >> mostly) and will be loading HBase with this.  Once this initial import
> >> is done, I will then harvest XML and Tab files into HBase directly
> >> (storing the raw XML record or tab file row as well).
> >> I am in early testing (awaiting hardware and fed up using EC2) so
> >> still running code on laptop and small tests.  I have 6 dell boxes (2
> >> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what
> >> performance I will get.
> >>
> >> Thanks,
> >>
> >> Tim
> >>
> >
>

Re: Question to speaker (tab file loading) at yesterdays user group

Reply via email to