Yeah I got that for a while as well. Though once online schema change was disabled I can complete the reduce step with 2.5 G of heap (still failing :-/).
On Tue, Sep 3, 2013 at 6:30 PM, Devaraj Das <[email protected]> wrote: > Well from the test code it seems like the problem is due to the fact that > the reducer got unexpected data and it was trying to construct the log > message for the user. So the job had already failed in reality. > > > On Tue, Sep 3, 2013 at 6:17 PM, Devaraj Das <[email protected]> wrote: > >> Elliott, what are the heap sizes of the M/R tasks in your setup. I was >> running the job like this (without chaosmonkey to start with): >> >> hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList Loop 5 12 >> 2500000 IntegrationTestBigLinkedList 10 >> >> Even the above test failed with one reduce task failing with OOM, in the >> verify step. The heap size was set to 3G. >> >> 2013-09-04 01:11:56,054 FATAL [main] org.apache.hadoop.mapred.YarnChild: >> Error running child : java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOf(Arrays.java:2882) >> at >> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) >> at >> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) >> at java.lang.StringBuilder.append(StringBuilder.java:119) >> at >> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$VerifyReducer.reduce(IntegrationTestBigLinkedList.java:576) >> at >> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$VerifyReducer.reduce(IntegrationTestBigLinkedList.java:547) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) >> at >> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405) >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) >> >> >> >> >> >> On Tue, Sep 3, 2013 at 12:56 PM, Elliott Clark <[email protected]> wrote: >> >>> Can someone take a look at running Test Big Linked for > 5 iterations >>> with slowDeterministic chaos monkey on a distributed cluster. I'm >>> pretty concerned about HBASE-9338 >>> >>> On Tue, Sep 3, 2013 at 6:57 AM, Jean-Marc Spaggiari >>> <[email protected]> wrote: >>> > There was a typo in my log4j.properties :( >>> > >>> > So it's working fine. >>> > >>> > The only INFO logs I still see are those one: >>> > 2013-09-03 09:53:07,313 INFO [M:0;t430s:45176] mortbay.log: Logging to >>> > org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via >>> > org.mortbay.log.Slf4jLog >>> > 2013-09-03 09:53:07,350 INFO [M:0;t430s:45176] mortbay.log: >>> jetty-6.1.26 >>> > But there is only very few of them. >>> > >>> > Performances wise, here are the numbers (the higher, the better. Rows >>> per >>> > seconds, expect for scans where it's rows/min). As you will see, 0.96 is >>> > slower only for RandomSeekScanTest (way slower) and >>> RandomScanWithRange10 >>> > but is faster for everything else. I ran the tests with the default >>> > settings. So I think we should look at RandomSeekScanTest but expect >>> this >>> > one, everything else is pretty good. >>> > >>> > Also, I have been able to reproduce this exception: >>> > 2013-09-03 09:55:36,718 WARN [NIOServerCxn.Factory: >>> 0.0.0.0/0.0.0.0:2181] >>> > server.NIOServerCnxn: caught end of stream exception >>> > EndOfStreamException: Unable to read additional data from client >>> sessionid >>> > 0x140e4191edb0009, likely client has closed socket >>> > at >>> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) >>> > at >>> > >>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) >>> > at java.lang.Thread.run(Thread.java:662) >>> > >>> > Just had to run PE and kill it in the middle. >>> > >>> > JM >>> > >>> > >>> > >>> > 0.96.0RC0 0.94.11 >>> > org.apache.hadoop.hbase.PerformanceEvaluation$FilteredScanTest 10.28 >>> 10.17 >>> > 101.12% org.apache.hadoop.hbase.PerformanceEvaluation$RandomReadTest >>> 966.01 >>> > 810.58 119.18% >>> > org.apache.hadoop.hbase.PerformanceEvaluation$RandomSeekScanTest 98.50 >>> > 255.71 38.52% >>> org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest >>> > 39251.17 25682.11 152.83% >>> > org.apache.hadoop.hbase.PerformanceEvaluation$RandomScanWithRange10Test >>> > 25844.88 28715.29 90.00% >>> > org.apache.hadoop.hbase.PerformanceEvaluation$RandomScanWithRange100Test >>> > 20029.48 18022.39 111.14% >>> > >>> org.apache.hadoop.hbase.PerformanceEvaluation$RandomScanWithRange1000Test >>> > 2692.16 2346.85 114.71% >>> > org.apache.hadoop.hbase.PerformanceEvaluation$SequentialReadTest 3002.18 >>> > 2875.83 104.39% >>> > org.apache.hadoop.hbase.PerformanceEvaluation$SequentialWriteTest >>> 38995.50 >>> > 26693.23 146.09% >>> > >>> > >>> > 2013/9/3 Stack <[email protected]> >>> > >>> >> On Mon, Sep 2, 2013 at 10:51 AM, Jean-Marc Spaggiari < >>> >> [email protected]> wrote: >>> >> >>> >> > I have created: >>> >> > HBASE-9412 >>> >> > HBASE-9413 >>> >> > HBASE-9414 >>> >> > >>> >> > I have not been able yet to reproduce the ZK error. I'm trying. >>> >> > >>> >> > >>> >> Is it when you have a shell connection and then kill it? >>> >> >>> >> >>> >> >>> >> > Last, I tried, with no success, to set loglevel to WARN to remove all >>> >> > DEBUG and INFO logs. Setting it to WARN remove the DEBUG lines, but I >>> >> > keep getting the INFO. Seems that something is setting the log level >>> >> > somewhere else, or it's not read. >>> >> > >>> >> > Here is my log4j.properties file. I removed all the customed log >>> level >>> >> > to setup WARN for org.apache. And it's still showing INFO... >>> >> > >>> >> > >>> >> >>> >> You did it by editing log4j and restarting or in the UI? I think the >>> UI >>> >> log level setting is broke.... (new issue!) >>> >> >>> >> Thanks for trying it out JMS, >>> >> >>> >> So everything is slower in 0.96? >>> >> St.Ack >>> >> >>> >> >>> >> >>> >> > JM >>> >> > >>> >> > >>> >> > # Define some default values that can be overridden by system >>> properties >>> >> > hbase.root.logger=WARN,console >>> >> > hbase.security.logger=WARN,console >>> >> > hbase.log.dir=. >>> >> > hbase.log.file=hbase.log >>> >> > >>> >> > # Define the root logger to the system property "hbase.root.logger". >>> >> > log4j.rootLogger=${hbase.root.logger} >>> >> > >>> >> > # Logging Threshold >>> >> > log4j.threshold=ALL >>> >> > >>> >> > # >>> >> > # Daily Rolling File Appender >>> >> > # >>> >> > log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender >>> >> > log4j.appender.DRFA.File=${hbase.log.dir}/${hbase.log.file} >>> >> > >>> >> > # Rollver at midnight >>> >> > log4j.appender.DRFA.DatePattern=.yyyy-MM-dd >>> >> > >>> >> > # 30-day backup >>> >> > #log4j.appender.DRFA.MaxBackupIndex=30 >>> >> > log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout >>> >> > >>> >> > # Pattern format: Date LogLevel LoggerName LogMessage >>> >> > log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] >>> %c{2}: >>> >> > %m%n >>> >> > >>> >> > # Rolling File Appender properties >>> >> > hbase.log.maxfilesize=256MB >>> >> > hbase.log.maxbackupindex=20 >>> >> > >>> >> > # Rolling File Appender >>> >> > log4j.appender.RFA=org.apache.log4j.RollingFileAppender >>> >> > log4j.appender.RFA.File=${hbase.log.dir}/${hbase.log.file} >>> >> > >>> >> > log4j.appender.RFA.MaxFileSize=${hbase.log.maxfilesize} >>> >> > log4j.appender.RFA.MaxBackupIndex=${hbase.log.maxbackupindex} >>> >> > >>> >> > log4j.appender.RFA.layout=org.apache.log4j.PatternLayout >>> >> > log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] >>> %c{2}: >>> >> > %m%n >>> >> > >>> >> > # >>> >> > # Security audit appender >>> >> > # >>> >> > hbase.security.log.file=SecurityAuth.audit >>> >> > hbase.security.log.maxfilesize=256MB >>> >> > hbase.security.log.maxbackupindex=20 >>> >> > log4j.appender.RFAS=org.apache.log4j.RollingFileAppender >>> >> > log4j.appender.RFAS.File=${hbase.log.dir}/${hbase.security.log.file} >>> >> > log4j.appender.RFAS.MaxFileSize=${hbase.security.log.maxfilesize} >>> >> > >>> log4j.appender.RFAS.MaxBackupIndex=${hbase.security.log.maxbackupindex} >>> >> > log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout >>> >> > log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n >>> >> > log4j.category.SecurityLogger=${hbase.security.logger} >>> >> > log4j.additivity.SecurityLogger=false >>> >> > >>> >> > >>> >> >>> #log4j.logger.SecurityLogger.org.apache.hadoop.hbase.security.access.AccessController=TRACE >>> >> > >>> >> > # >>> >> > # Null Appender >>> >> > # >>> >> > log4j.appender.NullAppender=org.apache.log4j.varia.NullAppender >>> >> > >>> >> > # >>> >> > # console >>> >> > # Add "console" to rootlogger above if you want to use this >>> >> > # >>> >> > log4j.appender.console=org.apache.log4j.ConsoleAppender >>> >> > log4j.appender.console.target=System.err >>> >> > log4j.appender.console.layout=org.apache.log4j.PatternLayout >>> >> > log4j.appender.console.layout.ConversionPattern=%d{ISO8601} %-5p [%t] >>> >> > %c{2}: %m%n >>> >> > >>> >> > # Custom Logging levels >>> >> > >>> >> > log4j.logger.ore.apache=WARN >>> >> > >>> >> > 2013/9/2 Jean-Marc Spaggiari <[email protected]>: >>> >> > > Hi St.Ack, >>> >> > > >>> >> > > I will open the relate JIRAs in few minutes. >>> >> > > >>> >> > > Regarding performances, RandomSeekScanTest is way slower, and only >>> >> > > RandomScanWithRange100Test was faster. Others were similar. For >>> >> > > RandomScanWithRange100Test I suspect that I don't have the right >>> >> > > number for 0.94 so 0.94.11 tests are running right now on the same >>> >> > > server with the same configuration. I will start to have numbers by >>> >> > > end of day, else, tomorrow morning, but will most probably take >>> about >>> >> > > 24h to get all of them. >>> >> > > >>> >> > > JM >>> >> > > >>> >> > > 2013/9/2 Stack <[email protected]>: >>> >> > >> On Mon, Sep 2, 2013 at 9:41 AM, Jean-Marc Spaggiari < >>> >> > [email protected] >>> >> > >>> wrote: >>> >> > >> >>> >> > >>> Got it. >>> >> > >>> >>> >> > >>> I can't run the integration tests for now because I'm lacking >>> some >>> >> > >>> servers :( Need to complete some HBase on RAID tests before I >>> can get >>> >> > >>> those new servers >>> >> > >>> >>> >> > >>> First thing is, start-hbase.cmd has the execute flag set. I don't >>> >> > >>> think it's required. And it will help with tabulation feature if >>> we >>> >> > >>> can un-set it. >>> >> > >>> >>> >> > >>> For 0.96.0RC0 here are my results: >>> >> > >>> First, I get 2 .out files. each time I start the server, instead >>> of >>> >> > >>> usually one... With the same timestamp. >>> >> > >>> >>> >> > >>> -rw-r--r-- 1 jmspaggiari jmspaggiari 0 Aug 31 15:38 >>> >> > >>> hbase-jmspaggiari-master-t430s.out >>> >> > >>> -rw-r--r-- 1 jmspaggiari jmspaggiari 0 Aug 31 15:38 >>> >> > >>> hbase-jmspaggiari-master-t430s.out.1 >>> >> > >>> >>> >> > >>> >>> >> > >>> In the UI, we say "The .META. table holds references to all User >>> >> Table >>> >> > >>> regions" but the table name is "hbase:meta" and not ".META." >>> >> > >>> >>> >> > >>> On the logs, I found this exception that I did not had before: >>> >> > >>> 2013-08-31 18:45:05,490 WARN >>> >> > >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] >>> server.NIOServerCnxn: >>> >> > >>> caught end of stream exception >>> >> > >>> EndOfStreamException: Unable to read additional data from client >>> >> > >>> sessionid 0x140d68bb9d50004, likely client has closed socket >>> >> > >>> at >>> >> > >>> >>> >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) >>> >> > >>> at >>> >> > >>> >>> >> > >>> >> >>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) >>> >> > >>> at java.lang.Thread.run(Thread.java:662) >>> >> > >>> >>> >> > >>> >>> >> > >>> I ran PE over 48h. I don't have another 0.96 baseline to compare >>> >> with, >>> >> > >>> so I compared with 0.94. >>> >> > >>> org.apache.hadoop.hbase.PerformanceEvaluation$RandomSeekScanTest >>> is >>> >> > >>> about 3 times slower with 0.96. >>> >> > >>> >>> >> > >>> org.apache.hadoop.hbase.PerformanceEvaluation$RandomScanWithRange100Test >>> >> > >>> seems to be 2 times faster. >>> >> > >>> >>> >> > >>> Writes also are faster but I changed my hard drive since I ran >>> 0.94. >>> >> > >>> So I'm currently re-running 0.94 and will provide another more >>> >> > >>> accurate comparison soon. >>> >> > >>> >>> >> > >>> Ran LoadTestTool: Failed to write keys: 0 >>> >> > >>> >>> >> > >>> I have not been able to run IntegrationTestLoadAndVerify nor >>> >> > >>> IntegrationTestBigLinkedList because of the lack of servers. >>> >> > >>> >>> >> > >>> So overall, it seems to be working fine, but I have not been >>> able to >>> >> > >>> test this release as deeply as I'm usually testing the 0.94 >>> releases. >>> >> > >>> >>> >> > >> >>> >> > >> Thank you for taking it for a spin JMS -- you the man. If you >>> don't >>> >> > open >>> >> > >> issues for the teething stuff, I will. >>> >> > >> >>> >> > >> So 0.96 is slower doing RandomSeekScanTest but faster on your >>> other >>> >> > tests? >>> >> > >> I can take a look too. >>> >> > >> >>> >> > >> St.Ack >>> >> > >>> >> >>> >> >> > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.
