[ 
https://issues.apache.org/jira/browse/HDDS-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835619#comment-17835619
 ] 

Mohammad Arafat Khan commented on HDDS-10614:
---------------------------------------------

The issue you are encountering is related to the *CachingSpaceUsageSource* and 
its interaction with the underlying {*}SpaceUsageSource{*}. The exception 
{*}java.lang.IllegalArgumentException{*}: _used space cannot be negative_ 
indicates that somewhere in the process, the calculation of used space has 
resulted in a negative number, which is not a valid state.

The root cause seems to be an error during the refresh operation of the 
{*}CachingSpaceUsageSource{*}. Specifically, the underlying *SpaceUsageSource* 
({_}likely an instance of DU, which uses the Unix du command to calculate disk 
usage{_}) is failing due to a permission issue when trying to read the 
*/data3/lost+found* directory. This failure might cause the getUsedSpace method 
to return an incorrect value ({*}possibly zero{*}), which, when decremented, 
results in a negative value. Here are the steps that are likely leading to the 
issue:
 # The CachingSpaceUsageSource attempts to refresh the used space metric by 
calling the refresh method.
 # The DU class, which is the actual SpaceUsageSource, tries to execute the du 
command to calculate the used space.
 # The du command fails with a permission error when trying to access the 
/data3/lost+found directory, which is a system directory typically accessible 
only by the root user.
 # Due to the error, the DU class might return an incorrect value (such as 
zero) for the used space.
 # The CachingSpaceUsageSource receives this incorrect value and updates its 
cached value accordingly. Subsequent calls to decrementUsedSpace then cause the 
cached value to go negative, which is not allowed and results in the 
{*}IllegalArgumentException{*}.

Fix for this from the recon side should be straightforward, we can Improve 
error handling in the CachingSpaceUsageSource to ensure that a failure in the 
DU class does not lead to an invalid state and if it does than we have a 
fallback value for it. 

cc: [~deveshsingh] [~zitadombi] 

> Recon fails to start with "used space cannot be negative"
> ---------------------------------------------------------
>
>                 Key: HDDS-10614
>                 URL: https://issues.apache.org/jira/browse/HDDS-10614
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Recon
>    Affects Versions: 1.4.0
>            Reporter: Kirill Sizov
>            Assignee: Mohammad Arafat Khan
>            Priority: Blocker
>
> Recon fails to start with the following exception. A restart does not help.
> {noformat}
> [qtp2008821270-42945] WARN org.eclipse.jetty.server.HttpChannelState: 
> unhandled due to prior sendError
> javax.servlet.ServletException: javax.servlet.ServletException: 
> java.lang.IllegalArgumentException: used space cannot be negative.
>         at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:162)
>         at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>         at org.eclipse.jetty.server.Server.handle(Server.java:516)
>         at 
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
>         at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
>         at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>         at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>         at 
> org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>         at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
>         at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
>         at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
>         at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
>         at 
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: javax.servlet.ServletException: 
> java.lang.IllegalArgumentException: used space cannot be negative.
>         at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410)
>         at 
> org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
>         at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
>         at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
>         at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
>         at 
> com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:290)
>         at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:280)
>         at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:184)
>         at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:89)
>         at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
>         at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
>         at 
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>         at 
> org.apache.hadoop.hdds.server.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1648)
>         at 
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>         at 
> org.apache.hadoop.hdds.server.http.NoCacheFilter.doFilter(NoCacheFilter.java:48)
>         at 
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>         at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
>         at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>         at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
>         ... 17 more
> Caused by: java.lang.IllegalArgumentException: used space cannot be negative.
>         at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
>         at 
> org.apache.hadoop.hdds.scm.container.placement.metrics.SCMNodeStat.<init>(SCMNodeStat.java:43)
>         at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.getNodeStatInternal(SCMNodeManager.java:973)
>         at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.getNodeStats(SCMNodeManager.java:882)
>         at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.getStats(SCMNodeManager.java:853)
>         at 
> org.apache.hadoop.ozone.recon.api.ClusterStateEndpoint.getClusterState(ClusterStateEndpoint.java:120)
>         at sun.reflect.GeneratedMethodAccessor228.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
>         at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
>         at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
>         at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
>         at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
>         at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475)
>         at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397)
>         at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
>         at 
> org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255)
>         at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
>         at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
>         at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
>         at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
>         at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
>         at 
> org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
>         at 
> org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234)
>         at 
> org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680)
>         at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
>         ... 50 more
> {noformat}
> *Possible reason:*
> Datanodes send reports with negative used space. 
> On the datanode side {{CachingSpaceUsageSource}} has a method 
> {{decrementUsedSpace}} which decrements total used size by a particular 
> value. The total used size is in turn taken from a delegate 
> {{SpaceUsageSource}}. If the latter fails to return a value, 
> {{decrementUsedSpace}} will subtract from 0, hence a negative value.
> A failure to get used size might be caused by smth like:
> {noformat}
> 2024-03-21 13:30:11,613 [DiskUsage-/data3-
> ] WARN org.apache.hadoop.hdds.fs.CachingSpaceUsageSource: Error refreshing 
> space usage for /data3
> java.io.UncheckedIOException: ExitCodeException exitCode=1: du: cannot read 
> directory '/data3/lost+found': Permission denied
>         at org.apache.hadoop.hdds.fs.DU$DUShell.getUsed(DU.java:94)
>         at 
> org.apache.hadoop.hdds.fs.AbstractSpaceUsageSource.time(AbstractSpaceUsageSource.java:56)
>         at org.apache.hadoop.hdds.fs.DU.getUsedSpace(DU.java:63)
>         at 
> org.apache.hadoop.hdds.fs.CachingSpaceUsageSource.refresh(CachingSpaceUsageSource.java:140)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: ExitCodeException exitCode=1: du: cannot read directory 
> '/data3/lost+found': Permission denied
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1068)
>         at org.apache.hadoop.util.Shell.run(Shell.java:957)
>         at org.apache.hadoop.hdds.fs.DU$DUShell.getUsed(DU.java:91)
>         ... 10 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to