Thanks all - so I will continue as planned - I will store the XML / tab file rows uncompressed and then try some built in compression, and might try compressing them myself in the app.
Thanks for the inputs! On Fri, Jan 16, 2009 at 12:24 AM, Andrew Purtell <[email protected]> wrote: >> From: tim robertson <[email protected]> >> > Until compression is super solid, I would be wary of >> > storing text (xml,html, etc) in hbase due to size >> > concerns. >> Hmmm... Where do the indexing guys store their raw >> harvested records / HTML / whatever then? > > Compression is lightly tested. In practice it adds to the > heap charge as extra byte buffers on the heap allocated for > {de,re}compression. I was using compression to archive web > content written to HBase by the Heritrix HBase writer, but > stopped using it after we ran into OOME issues at compaction. > The root cause of this was not directly related to > compression and Stack worked up a fix for 0.19 for that > cause. I may be ready to try compression again soon. > > For us, disk is cheap and we have ~20TB of effective HDFS > space (after subtracting for replication factor) to back > our HBase tables. Furthermore we use TTLs to expire content > after a certain period of time because it is no longer of > interest then (too out of date). One could use a mapreduce > task to accomplish the same with deletes -- also triggering/ > scheduling recrawing as needed/wanted. > > Anyway, I think what people are saying is just that > compression's use has been relatively rare on the clusters > where HBase has been mostly commonly under test. Something > to be aware of. Actually your use of it would be valuable > experience for the whole community. > > - Andy > > > > >
