Jan,

I apologize for the delayed response.

1.  Did you realize that the "log_jan.txt" file from #1 below documented a hard 
disk failure?  You mentioned a failed drive once.  I am not sure if this is the 
same drive.


2.  The "sse4_2" tells me that your Intel cpu supports hardware CRC32c 
calculation.  This feature is not useful to you at this moment (unless you want 
to pull the mv-hardware-crc branch from basho/leveldb).  It will bring some 
performance improvements to you in the next release IF we do not decide that 
your problems are hard disk performance limited.


3.  This just confirmed for me that the app.config is not accidentally blowing 
your physical memory.  The app.config file you posted suggested this was not 
the case, but I wanted to verify.

You also discuss a basho_bench failure.  Is this the same test run as the 
log_jan.txt file?  The hard drives had their first failure at:

2012/10/18-02:08:44.136238 7f8297fff700 Compaction error: Corruption: corrupted 
compressed block contents

And things go really bad at:

2012/10/18-06:10:37.657072 7f829effd700 Moving corrupted block to 
lost/BLOCKS.bad (size 1647)


4.  I was looking to see if your data was compressing well.  The answer it that 
it is.  You are achieving 2 to 2.6x compression ratio.  Since you are concerned 
about throughput, I was verifying that the time leveldb spends on block 
compression is worthwhile for you (it is).


The next question from me is whether the drive / disk array problems are your 
only problem at this point.  The data in log_jan.txt looks ok until the 
failures start.  I am willing to work more, but I need to better understand 
your next level of problems.

Matthew


On Oct 19, 2012, at 7:49 AM, <[email protected]> 
<[email protected]> wrote:

> Hi Matthew,
> 
> big thanks for responding. I see that you are the main committer to Basho's 
> leveldb code. :-)
> 
>> 1.  Execute the following on one of the servers:
>> sort /home/riak/leveldb/*/LOG* >log_jan.txt
> 
> See the attached file log_jan.txt.gz. It is from the stalled Riak node.
> 
>> 2.  Execute the following on one of the servers:
>>  grep -i flags /proc/cpuinfo
> 
> flags         : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat 
> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
> constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
> aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr 
> pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm arat epb 
> xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
> 
>> 3.  On a running server that is processing data, execute:
>> grep -i swap /proc/meminfo
> 
> I will restart the test and report back when it stalls again. In the meantime,
>  am sending you yesterday's zabbix graph showing memory usage on the node 
> (attached file ZabbixMemory.png). The time when the node stopped responding is
> logged as:
> 
> 2012-10-18 08:28:47.537 [error]  ** Node '[email protected]' not responding **
> 
> I am also attaching the corresponding Basho bench output of the test. The test
> was started on Oct 17 16:38 with an empty database, and it was run on a plain
> ext4 partition (no RAID).
> 
>> 4.  Pick a server, then one directory in /home/riak/leveldb.  Select 3 of 
> the largest *.sst files.  Tar/gzip those and email back.
> 
> I will send them in the next mail. I can also put the entire database 
> somewhere for you for download, if you need it.
> 
> Thanks, Jan<log_jan.txt.gz><ZabbixMemory.png><4Riak_2K_2.1RC2_noraid.png>


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to