Hi all, My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to distributed computing in FT/HA environments and I see there are a lot of issues reported related to the region server failure.
The main problem I see it is related to recovery time in case of a node failure and distributed log splitting. After some tunning I managed to reduce it to 8 seconds in total and for the moment it fits the needs. I have one question: *Why there is only one WAL file per region server and not one WAL per region itself? * I haven't found the exact answer anywhere, that's why i'm asking on this list and please point me to the right direction if i missed the list. My point is that eliminating the need of splitting a log in case of failure reduces the downtime for the regions and the only delay that we will see will be related to transferring data over network to the region servers that will take over the failed regions. This is feasible only if having multiple WAL's per Region Server does not affect the overall write performance. Thanks, Claudiu
