Re: [Nfs-ganesha-devel] async dispatch not good

2017-06-19 Thread Matt Benjamin
Hi,

- Original Message -
> From: "William Allen Simpson" 
> To: "Matt Benjamin" 
> Cc: "NFS Ganesha Developers" 
> Sent: Monday, June 19, 2017 10:03:53 PM
> Subject: Re: [Nfs-ganesha-devel] async dispatch not good
> 
> On 6/19/17 3:41 PM, Matt Benjamin wrote:
> > it's not about memory, this is the problem we're trying to avoid
> > 
> > but, referring for context to our verbal discussion earlier today, your
> > suggestion to hybridize the existing output side (which depends on
> > blocking sockets) and an async input side using recv() seems work at least
> > exploring;  I assume you are proposing to use recv() with MSG_DONTWAIT?
> > Yes.  Linux does support MSG_DONTWAIT, and it should be possible to try
> the recv() writev() hybrid approach.  At least there's one Oracle
> article that says it works
> 
> The underlying problem is EPOLL reaallly isn't a good design.  What we
> need for speed is callbacks that tell us that the read/write is done,
> not signals that there might be more data pending -- which cause us to
> do more system calls to find out.  System calls are the problem.

They have latency, sure.

> 
> kqueue is a much better design.  We should try to get kqueue support in
> the Linux kernel.  That would aid portability, too.

You're welcome to try, seems political.

> 
> But what I'm doing right now is backing out my previous attempt.  Even
> after dumping the mass code, awful lot of hooks to undo

Sorry.

> 
> My thought now is it's better to get the big changes in, then work on
> TCP I-O re-write separately (as I was doing for UDP and RDMA).  Quick
> and dirty shims, but only temporarily.

One of the key goals I have is read-frags-ahead/non-blocking decode.  Has been 
at the top of the queue since our initial meetings.  Seems like your recv() 
technique should work.  

> 
> While I'm thinking about it, why does Ganesha call svc_reg()?  AFAICT,
> that's just filling in a tree that is never used anymore.
> 
> Can I remove that code in Ganesha?  It's a pain to maintain in ntirpc.

If it's no longer effective, then eventually, sure.  Is it a substantial help 
to your work?

Matt

> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] async dispatch not good

2017-06-19 Thread William Allen Simpson

On 6/19/17 3:41 PM, Matt Benjamin wrote:

it's not about memory, this is the problem we're trying to avoid

but, referring for context to our verbal discussion earlier today, your 
suggestion to hybridize the existing output side (which depends on blocking 
sockets) and an async input side using recv() seems work at least exploring;  I 
assume you are proposing to use recv() with MSG_DONTWAIT?
Yes.  Linux does support MSG_DONTWAIT, and it should be possible to try

the recv() writev() hybrid approach.  At least there's one Oracle
article that says it works

The underlying problem is EPOLL reaallly isn't a good design.  What we
need for speed is callbacks that tell us that the read/write is done,
not signals that there might be more data pending -- which cause us to
do more system calls to find out.  System calls are the problem.

kqueue is a much better design.  We should try to get kqueue support in
the Linux kernel.  That would aid portability, too.

But what I'm doing right now is backing out my previous attempt.  Even
after dumping the mass code, awful lot of hooks to undo

My thought now is it's better to get the big changes in, then work on
TCP I-O re-write separately (as I was doing for UDP and RDMA).  Quick
and dirty shims, but only temporarily.

While I'm thinking about it, why does Ganesha call svc_reg()?  AFAICT,
that's just filling in a tree that is never used anymore.

Can I remove that code in Ganesha?  It's a pain to maintain in ntirpc.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] UDP VSOCK?

2017-06-19 Thread William Allen Simpson

On 6/19/17 3:44 PM, Matt Benjamin wrote:

there is no UDP vsock, it's always a stream socket, this could be done 
differently, as desired


Good, 'cause I've ready submitted the patch.



Also, VSOCK only needs to support NFS v3 and v4, not the other programs?

But I could be wrong?


This question still needs to be answered.  My unpublished big patch removes
VSOCK support for anything other than v3 and v4.  I've got no way to test.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] Mount connection timeout

2017-06-19 Thread Frank Filz
That’s just LRU running and not finding any work to do. I’m not sure those 
messages should be LogDebug, maybe they should be LogFullDebug.



Frank



From: Supriti Singh [mailto:supriti.si...@suse.com]
Sent: Monday, June 19, 2017 4:47 AM
To: nfs-ganesha-devel@lists.sourceforge.net
Subject: [Nfs-ganesha-devel] Mount connection timeout



I am using nfs-ganesha v2.5-final + CephFS FSAL. I have noticed that when I 
mount, sometimes first mount attempt fails with connection time out. And it 
succeed in later attempts.

In event of timeout, the log contains the following lines many times:

ganesha.nfsd-4274[cache_lru] lru_run :INODE LRU :DEBUG :After work, 
open_fd_count:0  count:5 fdrate:1 threadwait=90
ganesha.nfsd-4274[cache_lru] lru_run :INODE LRU :DEBUG :FD count is 0 and low 
water mark is 2048: not reaping.


Can someone please explain what could be possible reason?

Thanks,
Supriti

--

Supriti Singh

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,

HRB 21284 (AG Nürnberg)





---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


[Nfs-ganesha-devel] IRC Channel

2017-06-19 Thread Frank Filz
I'm not sure how well it's known, we do have a #ganesha channel on FreeNode
where many of the developers and a few users hang out. We are pretty
friendly and willing to answer questions.

Frank


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


[Nfs-ganesha-devel] Change in ffilz/nfs-ganesha[next]: Determine size of ACLs to be encoded when getting ACLs

2017-06-19 Thread GerritHub
>From :

madhu.punj...@in.ibm.com has uploaded this change for review. ( 
https://review.gerrithub.io/366139


Change subject: Determine size of ACLs to be encoded when getting ACLs
..

Determine size of ACLs to be encoded when getting ACLs

If nfs4_getfacl is run by the client when
there are large number of ACLs and if the server is not
able to fit in all ACLs in the buffer of size
NFS4_ATTRVALS_BUFFLEN then the client gets Input/Output
error.
To avoid this we now calculate at run time the size
required by ACLs, to determine the size of buffer where
the ACLs would be encoded.

Change-Id: I4ace8223abe2c2957f6e40cb619a1b838ca78677
Signed-off-by: Madhu Thorat 
---
M src/Protocols/NFS/nfs_proto_tools.c
1 file changed, 16 insertions(+), 2 deletions(-)



  git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha 
refs/changes/39/366139/1
-- 
To view, visit https://review.gerrithub.io/366139
To unsubscribe, visit https://review.gerrithub.io/settings

Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4ace8223abe2c2957f6e40cb619a1b838ca78677
Gerrit-Change-Number: 366139
Gerrit-PatchSet: 1
Gerrit-Owner: madhu.punj...@in.ibm.com
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] Mount connection timeout

2017-06-19 Thread Matt Benjamin
yeah

- Original Message -
> From: "Frank Filz" 
> To: "Supriti Singh" , 
> nfs-ganesha-devel@lists.sourceforge.net
> Sent: Monday, June 19, 2017 9:46:32 AM
> Subject: Re: [Nfs-ganesha-devel] Mount connection timeout
> 
> 
> 
> That’s just LRU running and not finding any work to do. I’m not sure those
> messages should be LogDebug, maybe they should be LogFullDebug.
> 
> 
> 
> Frank
> 
> 
> 
> 
> From: Supriti Singh [mailto:supriti.si...@suse.com]
> Sent: Monday, June 19, 2017 4:47 AM
> To: nfs-ganesha-devel@lists.sourceforge.net
> Subject: [Nfs-ganesha-devel] Mount connection timeout
> 
> 
> 
> 
> I am using nfs-ganesha v2.5-final + CephFS FSAL. I have noticed that when I
> mount, sometimes first mount attempt fails with connection time out. And it
> succeed in later attempts.
> 
> In event of timeout, the log contains the following lines many times:
> 
> ganesha.nfsd-4274[cache_lru] lru_run :INODE LRU :DEBUG :After work,
> open_fd_count:0 count:5 fdrate:1 threadwait=90
> ganesha.nfsd-4274[cache_lru] lru_run :INODE LRU :DEBUG :FD count is 0 and low
> water mark is 2048: not reaping.
> 
> 
> Can someone please explain what could be possible reason?
> 
> Thanks,
> Supriti
> 
> 
> --
> 
> 
> Supriti Singh
> 
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> 
> 
> HRB 21284 (AG Nürnberg)
> 
> 
> 
> 
>   Virus-free. www.avast.com
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] async dispatch not good

2017-06-19 Thread Matt Benjamin
Hi Bill,

inline

- Original Message -
> From: "William Allen Simpson" 
> To: "NFS Ganesha Developers" 
> Sent: Monday, June 19, 2017 3:04:52 AM
> Subject: [Nfs-ganesha-devel] async dispatch not good
> 
> As folks may have noticed, I've been re-working my old 2015 dispatch
> patches that eliminate the network input-side queues in Ganesha.
> 
> Matt had wanted fully async non-blocking I-O.  I've been poking at it
> for a week, and now am sure that's the wrong way to go.

I don't think so, but, below

> 
> It might still be good for FSALs.  Remains to be seen.  DanG and
> Soumya are looking at that now.
> 
> The devil in userland network I-O is system calls.  Each epoll_wait
> is a system call.  Each read or write is a system call.  Each thread
> switch is a system call.
> 
> My code in Ganesha v2.5 (NTIRPC v1.5) gets the network output down to
> one system call per request on a very hot thread.  Cannot do better,
> as trying harder would just push the data into kernel buffers,
> possibly slowing our own output (for various reasons).
> 
> Trying to re-work that for async non-blocking calls instead means
> many more system calls.  Instead of one clean writev with the TCP
> fragment header and all ready buffers in one single call, we'd at
> minimum have a call, an epoll_wait, spawn another work thread, then
> another call and/or release the buffer, rinse and repeat.

the expensive part of this (spawn) is necessary only due to aspects of the old 
design, but, considering effort, ok, below

> 
> For a long buffer chain (the times we want more performance), we'd
> have much less performance -- roughly 2 + (3 * number of buffers)
> additional system calls.  For common short response chains, still
> have the extra overhead of the epoll system call, doubling calls.
> 
> Also, using writev minimizes buffer copies.  Eliminating data
> copying will usually give far better performance.
> 
> The only thing async output is saving is waiting threads.  But I've
> already got the output threads down to the minimum (per interface).
> No gain here!
> 
> On the input side, the truly optimum reduction in system calls would
> be one read to get the TCP fragment header and up to 1500 bytes of
> data, followed (only when needed) by another read to get the entire
> rest of long fragments in one fell swoop.

well, maybe, not considering blocking?  I think we really do want avoid 
blocking in the paths that now can/do, but, below

> 
> With async input I've tried level triggered, and am getting spurious
> epoll read data signals.  Googling shows that's been a problem since
> at least 2014, but possible to program around.

ok

> 
> Still, this could be better, had it not been terrible for output-side.
> 
> Changing to edge triggered means that every good read would be
> followed by another read to make sure that we've gotten all the data.
> That is, common small reads turn into two (2) reads.  Doubling our
> system calls in the common case is not the way to go
> 
> In conclusion, with epoll we know when input data is available, so
> input threads aren't sitting around waiting anyway, and trying to
> minimize threads results in more system calls and poorer performance.
> 
> NTIRPC already defaults to 200 worker threads.  If we need more, we
> should allocate more.  Memory should not be an issue.

it's not about memory, this is the problem we're trying to avoid

but, referring for context to our verbal discussion earlier today, your 
suggestion to hybridize the existing output side (which depends on blocking 
sockets) and an async input side using recv() seems work at least exploring;  I 
assume you are proposing to use recv() with MSG_DONTWAIT?

Matt

> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] UDP VSOCK?

2017-06-19 Thread Matt Benjamin
there is no UDP vsock, it's always a stream socket, this could be done 
differently, as desired

Matt

- Original Message -
> From: "William Allen Simpson" 
> To: "NFS Ganesha Developers" 
> Sent: Friday, June 16, 2017 4:38:09 PM
> Subject: [Nfs-ganesha-devel] UDP VSOCK?
> 
> Tried to talk to DanG today, but he went home earlier than usual.  So
> maybe somebody else knows:
> 
> void Create_SVCXPRTs(void)
> {
>   protos p;
> 
>   LogFullDebug(COMPONENT_DISPATCH, "Allocation of the SVCXPRT");
>   for (p = P_NFS; p < P_COUNT; p++)
>   if (nfs_protocol_enabled(p)) {
>   Create_udp(p);
>   Create_tcp(p);
>   }
> #ifdef RPC_VSOCK
>   if (vsock)
>   create_vsock();
> #endif /* RPC_VSOCK */
> }
> 
> This creates a UDP VSOCK fd, a TCP VSOCK fd, and then another TCP VSOCK
> fd.  I'm fairly sure the the current code won't work properly for the
> UDP VSOCK, and I'm fairly sure that two TCP VSOCKs won't be used.
> 
> Also, VSOCK only needs to support NFS v3 and v4, not the other programs?
> 
> But I could be wrong?
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


[Nfs-ganesha-devel] async dispatch not good

2017-06-19 Thread William Allen Simpson

As folks may have noticed, I've been re-working my old 2015 dispatch
patches that eliminate the network input-side queues in Ganesha.

Matt had wanted fully async non-blocking I-O.  I've been poking at it
for a week, and now am sure that's the wrong way to go.

It might still be good for FSALs.  Remains to be seen.  DanG and
Soumya are looking at that now.

The devil in userland network I-O is system calls.  Each epoll_wait
is a system call.  Each read or write is a system call.  Each thread
switch is a system call.

My code in Ganesha v2.5 (NTIRPC v1.5) gets the network output down to
one system call per request on a very hot thread.  Cannot do better,
as trying harder would just push the data into kernel buffers,
possibly slowing our own output (for various reasons).

Trying to re-work that for async non-blocking calls instead means
many more system calls.  Instead of one clean writev with the TCP
fragment header and all ready buffers in one single call, we'd at
minimum have a call, an epoll_wait, spawn another work thread, then
another call and/or release the buffer, rinse and repeat.

For a long buffer chain (the times we want more performance), we'd
have much less performance -- roughly 2 + (3 * number of buffers)
additional system calls.  For common short response chains, still
have the extra overhead of the epoll system call, doubling calls.

Also, using writev minimizes buffer copies.  Eliminating data
copying will usually give far better performance.

The only thing async output is saving is waiting threads.  But I've
already got the output threads down to the minimum (per interface).
No gain here!

On the input side, the truly optimum reduction in system calls would
be one read to get the TCP fragment header and up to 1500 bytes of
data, followed (only when needed) by another read to get the entire
rest of long fragments in one fell swoop.

With async input I've tried level triggered, and am getting spurious
epoll read data signals.  Googling shows that's been a problem since
at least 2014, but possible to program around.

Still, this could be better, had it not been terrible for output-side.

Changing to edge triggered means that every good read would be
followed by another read to make sure that we've gotten all the data.
That is, common small reads turn into two (2) reads.  Doubling our
system calls in the common case is not the way to go

In conclusion, with epoll we know when input data is available, so
input threads aren't sitting around waiting anyway, and trying to
minimize threads results in more system calls and poorer performance.

NTIRPC already defaults to 200 worker threads.  If we need more, we
should allocate more.  Memory should not be an issue.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


[Nfs-ganesha-devel] Mount connection timeout

2017-06-19 Thread Supriti Singh
I am using nfs-ganesha v2.5-final + CephFS FSAL. I have noticed that when I 
mount, sometimes first mount attempt fails
with connection time out. And it succeed in later attempts. 

In event of timeout, the log contains the following lines many times:

ganesha.nfsd-4274[cache_lru] lru_run :INODE LRU :DEBUG :After work, 
open_fd_count:0  count:5 fdrate:1 threadwait=90
ganesha.nfsd-4274[cache_lru] lru_run :INODE LRU :DEBUG :FD count is 0 and low 
water mark is 2048: not reaping.


Can someone please explain what could be possible reason? 

Thanks,
Supriti 

--
Supriti Singh��SUSE Linux GmbH, GF: Felix Imend��rffer, Jane Smithard, Graham 
Norton,
HRB 21284 (AG N��rnberg)
 




--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel