[jira] [Commented] (HADOOP-16238) Add the possbility to set SO_REUSEADDR in IPC Server Listener

2019-04-09 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813219#comment-16813219
 ] 

Wilfred Spiegelenburg commented on HADOOP-16238:


Hi [~pbacsko], I have one tiny remark on the patch: there is a spurious empty 
line in the core-default change which could be removed. 
That could be fixed on checkin and should not need a new patch.

Beside that the change looks good to me. +1 (not binding) from my side

> Add the possbility to set SO_REUSEADDR in IPC Server Listener
> -
>
> Key: HADOOP-16238
> URL: https://issues.apache.org/jira/browse/HADOOP-16238
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: ipc
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
> Attachments: HADOOP-16238-001.patch, HADOOP-16238-002.patch, 
> HADOOP-16238-003.patch
>
>
> Currently we can't enable SO_REUSEADDR in the IPC Server. In some 
> circumstances, this would be desirable, see explanation here:
> [https://developer.ibm.com/tutorials/l-sockpit/#pitfall-3-address-in-use-error-eaddrinuse-]
> Rarely it also causes problems in a test case 
> {{TestMiniMRClientCluster.testRestart}}:
> {noformat}
> 2019-04-04 11:21:31,896 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(273)) - Service 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state 
> STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [test-host:35491] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [test-host:35491] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException
>  at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
>  at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
>  at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:178)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:165)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1244)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:355)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:127)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:493)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:312)
>  at 
> org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceStart(MiniMRYarnCluster.java:210)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.mapred.MiniMRYarnClusterAdapter.restart(MiniMRYarnClusterAdapter.java:73)
>  at 
> org.apache.hadoop.mapred.TestMiniMRClientCluster.testRestart(TestMiniMRClientCluster.java:114)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){noformat}
>  
> At least for testing, having this socket option enabled is benefical. We 
> could enable this with a new property like {{ipc.server.reuseaddr}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved

2018-10-28 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1600#comment-1600
 ] 

Wilfred Spiegelenburg commented on HADOOP-15864:


[~hexiaoqiao] there is big side effect outside HDFS also which is causing other 
junit tests to fail in YARN. I looked a bit further into this change to find a 
number of other callers of {{SecurityUtil.buildTokenService}} in YARN and 
MAPREDUCE and none seem to handle a {{null}} response value. All calls do a 
toString on the returned value without a null check causing a NPE.

I think we need to reconsider the null return value.


> Job submitter / executor fail when SBN domain name can not resolved
> ---
>
> Key: HADOOP-15864
> URL: https://issues.apache.org/jira/browse/HADOOP-15864
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.2, 3.2.1
>
> Attachments: HADOOP-15864-branch.2.7.001.patch, 
> HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, 
> HADOOP-15864.branch.2.7.004.patch
>
>
> Job submit failure and Task executes failure if Standby NameNode domain name 
> can not resolved on HDFS HA with DelegationToken feature.
> This issue is triggered when create {{ConfiguredFailoverProxyProvider}} 
> instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode 
> with Security. Since in HDFS HA mode UGI need include separate token for each 
> NameNode in order to dealing with Active-Standby switch, the double tokens' 
> content is same of course. 
> However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} 
> it checks whether the address of NameNode has been resolved or not, if Not, 
> throw #IllegalArgumentException upon, then job submitter/ task executor fail.
> HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets 
> resolve completely.
> Another questions many guys consider is why NameNode domain name can not 
> resolve? I think there are many scenarios, for instance node replace when 
> meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure 
> should not impact Hadoop cluster stability in my opinion.
> a. code ref: org.apache.hadoop.security.SecurityUtil line373-386
> {code:java}
>   public static Text buildTokenService(InetSocketAddress addr) {
> String host = null;
> if (useIpForTokenService) {
>   if (addr.isUnresolved()) { // host has no ip address
> throw new IllegalArgumentException(
> new UnknownHostException(addr.getHostName())
> );
>   }
>   host = addr.getAddress().getHostAddress();
> } else {
>   host = StringUtils.toLowerCase(addr.getHostName());
> }
> return new Text(host + ":" + addr.getPort());
>   }
> {code}
> b.exception log ref:
> {code:xml}
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Couldn't create proxy provider class 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
> at 
> org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172)
> at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303)
> at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)

[jira] [Commented] (HADOOP-15836) Review of AccessControlList

2018-10-23 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661488#comment-16661488
 ] 

Wilfred Spiegelenburg commented on HADOOP-15836:


I still think we need to make sure that the tests are not sensitive to ordering 
also. In the YARN case the user that runs the service is added to the admin ACL 
by the code (YARN-3804). If we maintain order on insert a code change could 
then break tests again. Checking the returned ACL string independent of the 
order is not that difficult (see attached proposal  
[^assertEqualACLStrings.patch] ).

Replacing all the direct string compares of ACLs in tests should be a small 
effort with that.

> Review of AccessControlList
> ---
>
> Key: HADOOP-15836
> URL: https://issues.apache.org/jira/browse/HADOOP-15836
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: common, security
>Affects Versions: 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HADOOP-15836.1.patch, assertEqualACLStrings.patch
>
>
> * Improve unit tests (expected / actual were backwards)
> * Unit test expected elements to be in order but the class's return 
> Collections were unordered
> * Formatting cleanup
> * Removed superfluous white space
> * Remove use of LinkedList
> * Removed superfluous code
> * Use {{unmodifiable}} Collections where JavaDoc states that caller must not 
> manipulate the data structure



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15836) Review of AccessControlList

2018-10-23 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated HADOOP-15836:
---
Attachment: assertEqualACLStrings.patch

> Review of AccessControlList
> ---
>
> Key: HADOOP-15836
> URL: https://issues.apache.org/jira/browse/HADOOP-15836
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: common, security
>Affects Versions: 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HADOOP-15836.1.patch, assertEqualACLStrings.patch
>
>
> * Improve unit tests (expected / actual were backwards)
> * Unit test expected elements to be in order but the class's return 
> Collections were unordered
> * Formatting cleanup
> * Removed superfluous white space
> * Remove use of LinkedList
> * Removed superfluous code
> * Use {{unmodifiable}} Collections where JavaDoc states that caller must not 
> manipulate the data structure



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12640) Code Review AccessControlList

2018-10-22 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-12640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660098#comment-16660098
 ] 

Wilfred Spiegelenburg commented on HADOOP-12640:


I ran into this jira because of the test failures introduced via  HADOOP-15836. 
This change could break the ACLs.

The split for the string used in the buildACLFromString uses a greedy 
qualifier. This would change the path through the split and not use a simple 
non regular expression splitter. It now compiles the pattern and uses that to 
build the ACL. It does not change the outcome but it is more expensive.

It also includes two behavioural changes
# the way empty values are interpreted at the moment when a string is converted 
into an ACL. If I use this string as as the input as an example: {code}",joe 
tardis,,users"{code} Currently that gives me one user {{"joe"}} and groups 
{{"tardis"}} and {{"users"}} With your code change I get back two extra empty 
entries: one in the users and one in the groups. This might cause behavioural 
changes.
# The other behavioural change is that a {{null}} string used to throw a NPE. 
It now is silently ignored and is turned into a "block everything" ACL. 

> Code Review AccessControlList
> -
>
> Key: HADOOP-12640
> URL: https://issues.apache.org/jira/browse/HADOOP-12640
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AccessControlList.patch, AccessControlList.patch, 
> HADOOP-12640.1.patch
>
>
> After some confusion of my own, in particular with 
> "mapreduce.job.acl-view-job," I have looked over the AccessControlList 
> implementation and cleaned it up and clarified a few points.
> 1) I added tests to demonstrate the existing behavior of including an 
> asterisk in either the username or the group field, it overrides everything 
> and allows all access.
> "user1,user2,user3 *" = all access
> "* group1,group2" = all access
> "* *" = all access
> "* " = all access
> " *" = all access
> 2) General clean-up and simplification



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15621) S3Guard: Implement time-based (TTL) expiry for Authoritative Directory Listing

2018-10-03 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636502#comment-16636502
 ] 

Wilfred Spiegelenburg commented on HADOOP-15621:


[~gabor.bota] and [~fabbri]: this checkin has broken the build.

Based on the commit message two new files have been missed on checkin:
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/ExpirableMetadata.java
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3GuardTtl.java

Can you please check, I cannot reopen this jira as i do not seem to have the 
permission in the hadoop-common project

> S3Guard: Implement time-based (TTL) expiry for Authoritative Directory Listing
> --
>
> Key: HADOOP-15621
> URL: https://issues.apache.org/jira/browse/HADOOP-15621
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Aaron Fabbri
>Assignee: Gabor Bota
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HADOOP-15621.001.patch, HADOOP-15621.002.patch
>
>
> Similar to HADOOP-13649, I think we should add a TTL (time to live) feature 
> to the Dynamo metadata store (MS) for S3Guard.
> This is a similar concept to an "online algorithm" version of the CLI prune() 
> function, which is the "offline algorithm".
> Why: 
>  1. Self healing (soft state): since we do not implement transactions around 
> modification of the two systems (s3 and metadata store), certain failures can 
> lead to inconsistency between S3 and the metadata store (MS) state. Having a 
> time to live (TTL) on each entry in S3Guard means that any inconsistencies 
> will be time bound. Thus "wait and restart your job" becomes a valid, if 
> ugly, way to get around any issues with FS client failure leaving things in a 
> bad state.
>  2. We could make manual invocation of `hadoop s3guard prune ...` 
> unnecessary, depending on the implementation.
>  3. Makes it possible to fix the problem that dynamo MS prune() doesn't prune 
> directories due to the lack of true modification time.
> How:
>  I think we need a new column in the dynamo table "entry last written time". 
> This is updated each time the entry is written to dynamo.
>  After that we can either
>  1. Have the client simply ignore / elide any entries that are older than the 
> configured TTL.
>  2. Have the client delete entries older than the TTL.
> The issue with #2 is it will increase latency if done inline in the context 
> of an FS operation. We could mitigate this some by using an async helper 
> thread, or probabilistically doing it "some times" to amortize the expense of 
> deleting stale entries (allowing some batching as well).
> Caveats:
>  - Clock synchronization as usual is a concern. Many clusters already keep 
> clocks close enough via NTP. We should at least document the requirement 
> along with the configuration knob that enables the feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13064) LineReader reports incorrect number of bytes read resulting in correctness issues using LineRecordReader

2016-04-28 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263563#comment-15263563
 ] 

Wilfred Spiegelenburg commented on HADOOP-13064:


Is this not a duplicate of MAPREDUCE-6481 in combination with MAPREDUCE-5948?

> LineReader reports incorrect number of bytes read resulting in correctness 
> issues using LineRecordReader
> 
>
> Key: HADOOP-13064
> URL: https://issues.apache.org/jira/browse/HADOOP-13064
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Joe Ellis
>Priority: Critical
> Attachments: LineReaderTest.java
>
>
> The specific issue we were seeing with LineReader is that when we pass in 
> '\r\n' as the line delimiter the number of bytes that it claims to have read 
> is less than what it actually read. We narrowed this down to only happening 
> when the delimiter is split across the internal buffer boundary, so if 
> fillbuffer fills with "row\r" and the next call fills with "\n" then the 
> number of bytes reported would be 4 rather than 5.
> This results in correctness issues in LineRecordReader because if this off by 
> one issue is seen enough times when reading a split then it will continue to 
> read records past its split boundary, resulting in records appearing to come 
> from multiple splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-09-14 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744795#comment-14744795
 ] 

Wilfred Spiegelenburg commented on HADOOP-11252:


sorry I have been occupied with a number of other things over the last period. 
I finally have some cycles and will look at this over the coming days.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-12-03 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233209#comment-14233209
 ] 

Wilfred Spiegelenburg commented on HADOOP-11252:


[~andrew.wang] due to the way the rpc timeout in the client code overwrites the 
ping time out you are most likely correct. I'll have to step through the code 
in the client to make sure it behaves as intended. The ping is generated after 
a {{SocketTimeoutException}} is thrown on the input stream which is triggered 
by the {{setSoTimeout(pingInterval)}} on the socket, combined with the 
overwrite that could be a problem. This might require a further decoupling of 
the ping and rpc time out.

I also noticed that the ping output stream is created with a fixed timeout of 
0, that means we can still hang up there after the changes.

Looking at the HDFS code also to see how it is handled there and all references 
to the timeout that we are setting are called socket write time out. I am 
happy to call it something else but this seems to be in line with HDFS also. 
The SO_SNDTIMEO only comes into play when the send buffers on the OS level on 
the local machine are full (as far as I am aware). If the buffer was not full 
when I wrote the data the time out will never trigger and I directly fall 
through to the tcp retries. That case should be handled by the time out we are 
setting.

The default change was a proposal and setting it to 0 is the right choice for 
backwards compatibility.

 RPC client write does not time out by default
 -

 Key: HADOOP-11252
 URL: https://issues.apache.org/jira/browse/HADOOP-11252
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Critical
 Attachments: HADOOP-11252.patch


 The RPC client has a default timeout set to 0 when no timeout is passed in. 
 This means that the network connection created will not timeout when used to 
 write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
 writes then fall back to the tcp level retry (configured via tcp_retries2) 
 and timeouts between the 15-30 minutes. Which is too long for a default 
 behaviour.
 Using 0 as the default value for timeout is incorrect. We should use a sane 
 value for the timeout and the ipc.ping.interval configuration value is a 
 logical choice for it. The default behaviour should be changed from 0 to the 
 value read for the ping interval from the Configuration.
 Fixing it in common makes more sense than finding and changing all other 
 points in the code that do not pass in a timeout.
 Offending code lines:
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
 and 
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HADOOP-11252) RPC client write does not time out by default

2014-11-27 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reassigned HADOOP-11252:
--

Assignee: Wilfred Spiegelenburg

 RPC client write does not time out by default
 -

 Key: HADOOP-11252
 URL: https://issues.apache.org/jira/browse/HADOOP-11252
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Critical

 The RPC client has a default timeout set to 0 when no timeout is passed in. 
 This means that the network connection created will not timeout when used to 
 write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
 writes then fall back to the tcp level retry (configured via tcp_retries2) 
 and timeouts between the 15-30 minutes. Which is too long for a default 
 behaviour.
 Using 0 as the default value for timeout is incorrect. We should use a sane 
 value for the timeout and the ipc.ping.interval configuration value is a 
 logical choice for it. The default behaviour should be changed from 0 to the 
 value read for the ping interval from the Configuration.
 Fixing it in common makes more sense than finding and changing all other 
 points in the code that do not pass in a timeout.
 Offending code lines:
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
 and 
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11252) RPC client write does not time out by default

2014-11-27 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated HADOOP-11252:
---
Attachment: HADOOP-11252.patch

 RPC client write does not time out by default
 -

 Key: HADOOP-11252
 URL: https://issues.apache.org/jira/browse/HADOOP-11252
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Critical
 Attachments: HADOOP-11252.patch


 The RPC client has a default timeout set to 0 when no timeout is passed in. 
 This means that the network connection created will not timeout when used to 
 write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
 writes then fall back to the tcp level retry (configured via tcp_retries2) 
 and timeouts between the 15-30 minutes. Which is too long for a default 
 behaviour.
 Using 0 as the default value for timeout is incorrect. We should use a sane 
 value for the timeout and the ipc.ping.interval configuration value is a 
 logical choice for it. The default behaviour should be changed from 0 to the 
 value read for the ping interval from the Configuration.
 Fixing it in common makes more sense than finding and changing all other 
 points in the code that do not pass in a timeout.
 Offending code lines:
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
 and 
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-11-27 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228120#comment-14228120
 ] 

Wilfred Spiegelenburg commented on HADOOP-11252:


A first version of a patch to set a value for the write timeout. I have used 
the proposed write timeout property name as given by [~cmccabe]. I have set it 
to an arbitrary default of 5 minutes, which seems reasonable. We still allow 
anyone to still set it to 0 (no timeout) if they want it.

The cases described by [~mingma] above YARN-2714, HDFS-4858 and YARN-2578 all 
have the same no response to a write cause. Setting this timeout should solve 
all these issues.
One point I think needs to be checked is the getTimeOut() in Client.java. It 
uses the ping interval as a timeout, if ping is not enabled. I think that this 
should be changed to use the same timeout as this change introduces. Also 
changing the time out based on the ping interval is not really logical. This 
jira might not be the correct one to introduce that change so I left it out.

 RPC client write does not time out by default
 -

 Key: HADOOP-11252
 URL: https://issues.apache.org/jira/browse/HADOOP-11252
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Critical
 Attachments: HADOOP-11252.patch


 The RPC client has a default timeout set to 0 when no timeout is passed in. 
 This means that the network connection created will not timeout when used to 
 write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
 writes then fall back to the tcp level retry (configured via tcp_retries2) 
 and timeouts between the 15-30 minutes. Which is too long for a default 
 behaviour.
 Using 0 as the default value for timeout is incorrect. We should use a sane 
 value for the timeout and the ipc.ping.interval configuration value is a 
 logical choice for it. The default behaviour should be changed from 0 to the 
 value read for the ping interval from the Configuration.
 Fixing it in common makes more sense than finding and changing all other 
 points in the code that do not pass in a timeout.
 Offending code lines:
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
 and 
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11252) RPC client write does not time out by default

2014-11-27 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated HADOOP-11252:
---
Status: Patch Available  (was: Open)

 RPC client write does not time out by default
 -

 Key: HADOOP-11252
 URL: https://issues.apache.org/jira/browse/HADOOP-11252
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Critical
 Attachments: HADOOP-11252.patch


 The RPC client has a default timeout set to 0 when no timeout is passed in. 
 This means that the network connection created will not timeout when used to 
 write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
 writes then fall back to the tcp level retry (configured via tcp_retries2) 
 and timeouts between the 15-30 minutes. Which is too long for a default 
 behaviour.
 Using 0 as the default value for timeout is incorrect. We should use a sane 
 value for the timeout and the ipc.ping.interval configuration value is a 
 logical choice for it. The default behaviour should be changed from 0 to the 
 value read for the ping interval from the Configuration.
 Fixing it in common makes more sense than finding and changing all other 
 points in the code that do not pass in a timeout.
 Offending code lines:
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
 and 
 https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11323) WritableComparator: default implementation of compare keeps reference to byte array

2014-11-20 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created HADOOP-11323:
--

 Summary: WritableComparator: default implementation of compare 
keeps reference to byte array
 Key: HADOOP-11323
 URL: https://issues.apache.org/jira/browse/HADOOP-11323
 Project: Hadoop Common
  Issue Type: Improvement
  Components: performance
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg


When the default compare is used on a WritableComparator a reference to the 
second passed in byte array is kept in the buffer. Since WritableComparator 
keeps a reference to the buffer the byte will never be garbage collected. This 
can lead to a higher heap use than needed.

The buffer should drop the reference to the byte array passed in. We can null 
out the byte array reference since the buffer is a private variable for the 
class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11323) WritableComparator: default implementation of compare keeps reference to byte array

2014-11-20 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated HADOOP-11323:
---
Attachment: HADOOP-11323.patch

 WritableComparator: default implementation of compare keeps reference to byte 
 array
 ---

 Key: HADOOP-11323
 URL: https://issues.apache.org/jira/browse/HADOOP-11323
 Project: Hadoop Common
  Issue Type: Improvement
  Components: performance
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
 Attachments: HADOOP-11323.patch


 When the default compare is used on a WritableComparator a reference to the 
 second passed in byte array is kept in the buffer. Since WritableComparator 
 keeps a reference to the buffer the byte will never be garbage collected. 
 This can lead to a higher heap use than needed.
 The buffer should drop the reference to the byte array passed in. We can null 
 out the byte array reference since the buffer is a private variable for the 
 class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11323) WritableComparator: default implementation of compare keeps reference to byte array

2014-11-20 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated HADOOP-11323:
---
Assignee: Wilfred Spiegelenburg
  Status: Patch Available  (was: Open)

 WritableComparator: default implementation of compare keeps reference to byte 
 array
 ---

 Key: HADOOP-11323
 URL: https://issues.apache.org/jira/browse/HADOOP-11323
 Project: Hadoop Common
  Issue Type: Improvement
  Components: performance
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
 Attachments: HADOOP-11323.patch


 When the default compare is used on a WritableComparator a reference to the 
 second passed in byte array is kept in the buffer. Since WritableComparator 
 keeps a reference to the buffer the byte will never be garbage collected. 
 This can lead to a higher heap use than needed.
 The buffer should drop the reference to the byte array passed in. We can null 
 out the byte array reference since the buffer is a private variable for the 
 class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11252) RPC client write does not time out by default

2014-10-30 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created HADOOP-11252:
--

 Summary: RPC client write does not time out by default
 Key: HADOOP-11252
 URL: https://issues.apache.org/jira/browse/HADOOP-11252
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg


The RPC client has a default timeout set to 0 when no timeout is passed in. 
This means that the network connection created will not timeout when used to 
write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for writes 
then fall back to the tcp level retry (configured via tcp_retries2) and 
timeouts between the 15-30 minutes. Which is too long for a default behaviour.

Using 0 as the default value for timeout is incorrect. We should use a sane 
value for the timeout and the ipc.ping.interval configuration value is a 
logical choice for it. The default behaviour should be changed from 0 to the 
value read for the ping interval from the Configuration.

Fixing it in common makes more sense than finding and changing all other points 
in the code that do not pass in a timeout.

Offending code lines:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
and 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)