[
https://issues.apache.org/jira/browse/HBASE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321319#comment-16321319
]
Appy commented on HBASE-19715:
------------------------------
Ahh, why are we using exception here. Use of an optional boolean flag would
have been better. But anyways, can't change now.
If we are really using a free-form byte array to convey the state back, and
given that in this situation it can really blow out of proportion i.e say a
batch of 1000 requests reaches response limit at 100th, then we'll repeat the
state (exception) for rest of 900, the best we can do now is, reduce its size.
Let's not hack our system more.
Here's proposed solution, including some from the changes by [~chia7712] above.
- Cache NameBytePair for the exception
- Pass empty string to constructor (exception class comments are good and
clearly state what it means): The choice is between no exception message vs RS
dying because of long GC pause since it had to build NameBytePair for 100s of
requests. I saw it in tests.
- no need of including stack trace in NameBytePair
> Fix timing out test TestMultiRespectsLimits
> -------------------------------------------
>
> Key: HBASE-19715
> URL: https://issues.apache.org/jira/browse/HBASE-19715
> Project: HBase
> Issue Type: Bug
> Reporter: Appy
> Assignee: Appy
> Attachments: HBASE-19715.test.patch, HBASE-19715.test.v2.patch,
> failued.txt, passed.txt, screenshot-1.png, screenshot-2.png,
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png
>
>
> !screenshot-1.png|width=800px!
> Attached logs for both cases, when it passes and fails.
> Link (temporary) to logs:
> passed:
> http://104.198.223.121:8080/job/HBase-Flaky-Tests/33449/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiRespectsLimits-output.txt/*view*/
> failed:
> http://104.198.223.121:8080/job/HBase-Flaky-Tests/33455/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiRespectsLimits-output.txt/*view*/
> Correlating across more runs, whenever the tests passes, it does so within
> 10-30sec of 3min deadline for medium tests.
> So i think we can make it pass by just increasing the timeout.
> But I'm a bit skeptical after seeing all those long GC pauses (10sec +) in
> the log. Test code doesn't seem to be doing anything that intensive. Are we
> mismanaging the memory somewhere?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)