Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Pranith Kumar Karampuri
On Tue, Nov 1, 2016 at 7:32 AM, Lian, George (Nokia - CN/Hangzhou) <
george.l...@nokia.com> wrote:

> Hi,
>
>
>
> I will test it with your patches and update to you when I have result.
>

hi George,
  Please use http://review.gluster.org/#/c/15757/2 i.e. second version
of Raghavendra's patch. I tested it and it worked fine. We are still trying
to figure out quick-read and readdir-ahead as I type this mail.


>
> Thanks a lots
>
>
>
> Best Regards,
>
> George
>
>
>
> *From:* Pranith Kumar Karampuri [mailto:pkara...@redhat.com]
> *Sent:* Monday, October 31, 2016 11:23 AM
> *To:* Lian, George (Nokia - CN/Hangzhou) 
> *Cc:* Raghavendra Gowdappa ; Zhang, Bingxuan (Nokia
> - CN/Hangzhou) ; Gluster-devel@gluster.org;
> Zizka, Jan (Nokia - CZ/Prague) 
>
> *Subject:* Re: [Gluster-devel] Issue about the size of fstat is less than
> the really size of the syslog file
>
>
>
> Removing i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com, it is causing
> mail delivery problems for me.
>
> George,
>
>  Raghavendra and I made some progress on this issue. We were in
> parallel working on another issue which is similar where elastic search
> indices are getting corrupted because of wrong stat sizes in our opinion.
> So I have been running different translator stacks in identifying the
> problematic xlators which are leading to indices corruption.
>
>   We found the list to be 1) Write-behind, 2) Quick-read, 3)
> Readdir-ahead. Raghavendra and I just had a chat and we are suspecting that
> lack of lookup/readdirp implementation in write-behind could be the reason
> for this problem. Similar problems may exist in other two xlators too. But
> we are working on write-behind with priority.
>
> Our theory is this:
>
> If we do a 4KB write for example and it is cached in write-behind and we
> do a lookup on the file/do a readdirp on the directory with this file we
> send out wrong stat value to the kernel. There are different caches between
> kernel and gluster which may lead to fstat never coming till write-behind.
> So we need to make sure that we don't get into this situation.
>
> Action items:
>
>  At the moment Raghavendra is working on a patch to implement
> lookup/readdirp in write-behind. I am going to test the same for elastic
> search. Will it be possible for you to test your application against the
> same patch and confirm that the patch fixes the problem?
>
>
>
> On Fri, Oct 28, 2016 at 12:08 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
> hi George,
>
>It would help if we can identify the bare minimum xlators which are
> contributing to the issue like Raghavendra was mentioning earlier. We were
> wondering if it is possible for you to help us in identifying the issue by
> running the workload on a modified setup? We can suggest testing out using
> custom volfiles so that we can slowly build the graph which could be
> causing this issue. We would like you guys to try out this problem with
> just posix-xlator and fuse and nothing else.
>
>
>
> On Thu, Oct 27, 2016 at 1:40 PM, Lian, George (Nokia - CN/Hangzhou) <
> george.l...@nokia.com> wrote:
>
> Hi, Raghavendra,
>
> Could you please give some suggestion for this issue? we try to find the
> clue for this issue for a long time, but it has no progress:(
>
> Thanks & Best Regards,
> George
>
> -Original Message-
> From: Lian, George (Nokia - CN/Hangzhou)
> Sent: Wednesday, October 19, 2016 4:40 PM
> To: 'Raghavendra Gowdappa' 
> Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <
> i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan (Nokia
> - CN/Hangzhou) ; Zizka, Jan (Nokia - CZ/Prague)
> 
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
> the really size of the syslog file
>
> Hi, Raghavendra
>
> Just now, we test it with glusterfs log with debug-level "TRACE", and let
> some application trigger "glusterfs" produce large log, in that case, when
> we set write-behind and stat-prefetch both OFF,
> Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed
> with "file truncated",
>
> So that means if file's IO in huge amount, the issue will still be there
> even write-behind and stat-prefetch both OFF.
>
> Best Regards,
> George
>
> -Original Message-
> From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com]
>
> Sent: Wednesday, October 19, 2016 2:54 PM
> To: Lian, George (Nokia - CN/Hangzhou) 
> Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <
> i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan (Nokia
> - CN/Hangzhou) ; Zizka, Jan (Nokia - CZ/Prague)
> 
> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than
> the really size of the syslog file
>
>
>
> - 

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "George Lian (Nokia - CN/Hangzhou)" 
> Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS" 
> , "Bingxuan Zhang (Nokia
> - CN/Hangzhou)" , Gluster-devel@gluster.org, "Jan 
> Zizka (Nokia - CZ/Prague)"
> 
> Sent: Tuesday, November 1, 2016 7:46:47 AM
> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the 
> really size of the syslog file
> 
> 
> 
> - Original Message -
> > From: "George Lian (Nokia - CN/Hangzhou)" 
> > To: "Raghavendra Gowdappa" , "Jan Zizka (Nokia -
> > CZ/Prague)" , "Bingxuan
> > Zhang (Nokia - CN/Hangzhou)" 
> > Cc: "Pranith Kumar Karampuri" ,
> > "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> > ,
> > Gluster-devel@gluster.org
> > Sent: Tuesday, November 1, 2016 6:35:10 AM
> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the
> > really size of the syslog file
> > 
> > Hi, Raghavendra,
> > 
> > Thanks a lots for your update!
> > 
> > >IIUC, the "tail issue" can happen if 'tail -f' reads a stat with st_size
> > >lesser than previously read value (and hence the complaint - file
> > >truncated). In this case, even though fstat at T2 doesn't account the
> > >write
> > >at T0, it doesn't prove that st_size of fstat at T2 is lesser than that at
> > >any time before T2.
> > 
> > I just mean the st_size of fstat maybe less than the previously read value
> > in
> > that time, and it will lead to the "tail truncated" issue. Do you agree
> > with
> > me?
> 
> Yes. But, in your example there is only one fstat. For this to happen we need
> atleast two fstats and the latest st_size should be less than the oldest
> one. Am I missing anything here?
> 
> > 
> > >As to the relative ordering of write at T0 and fstat at T2, POSIX leaves
> > >it
> > >undefined. Unless write and fstat happen from same
> > >thread/single-threaded-application there is no requirement for maintaining
> > >that order (If they are issued from same thread fstat should account write
> > >at T0). Also note that it is not mentioned here fstat at T2 is issued
> > >_after_ write at T0 is _complete_. If that is the case, mdc_writev_cbk
> > >would've updated correct stat in cache and fstat would get correct value.
> > >If it is not the case, then there is no well defined order here.
> > 
> > >So, I don't think there is a bug here, unless I've missed out something.
> > 
> > Do you mean the GlusterFS not conflict with the requirement, so that the
> > application like "tail" should consider the case in network file system?
> 
> No. Applications shouldn't do anything different to work on Glusterfs.
> Otherwise its a bug :). What I am saying is that the issue with 'tail -f'
> might be because of a different bug than the example you gave. In other
> words, the RCA you posted may not be correct. It might be because of issues
> with write-behind (and other xlators) as I posted in other mail.
> 
> Priliminary testing by Pranith showed that Elasticsearch works fine with just
> write-behind. 

with patch http://review.gluster.org/15757 applied.

> So, that's a progress. Will keep you posted with our efforts
> on getting Elasticsearch working on Gluster. I've a feeling that, it will
> solve your issue (tail -f) too.
> 
> regards,
> Raghavendra
> 
> > 
> > @Jan & @Bingxuan, do you have some comments for the above information?
> > 
> > 
> > Best Regards,
> > George
> > 
> > -Original Message-
> > From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com]
> > Sent: Monday, October 31, 2016 6:35 PM
> > To: Lian, George (Nokia - CN/Hangzhou) 
> > Cc: Pranith Kumar Karampuri ;
> > I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> > ; Zhang, Bingxuan (Nokia
> > -
> > CN/Hangzhou) ; Gluster-devel@gluster.org; Zizka,
> > Jan (Nokia - CZ/Prague) 
> > Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the
> > really size of the syslog file
> > 
> > 
> > 
> > - Original Message -
> > > From: "George Lian (Nokia - CN/Hangzhou)" 
> > > To: "Pranith Kumar Karampuri" , "Raghavendra
> > > Gowdappa"
> > > 
> > > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> > > , "Bingxuan Zhang
> > > (Nokia
> > > - CN/Hangzhou)" , Gluster-devel@gluster.org,
> > > "Jan
> > > Zizka (Nokia - CZ/Prague)"
> > > 
> > > Sent: Monday, October 31, 2016 2:32:34 PM
> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
> > > the
> > > really size of 

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Raghavendra Gowdappa


- Original Message -
> From: "George Lian (Nokia - CN/Hangzhou)" 
> To: "Raghavendra Gowdappa" , "Jan Zizka (Nokia - 
> CZ/Prague)" , "Bingxuan
> Zhang (Nokia - CN/Hangzhou)" 
> Cc: "Pranith Kumar Karampuri" , 
> "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> , Gluster-devel@gluster.org
> Sent: Tuesday, November 1, 2016 6:35:10 AM
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
> really size of the syslog file
> 
> Hi, Raghavendra,
> 
> Thanks a lots for your update!
> 
> >IIUC, the "tail issue" can happen if 'tail -f' reads a stat with st_size
> >lesser than previously read value (and hence the complaint - file
> >truncated). In this case, even though fstat at T2 doesn't account the write
> >at T0, it doesn't prove that st_size of fstat at T2 is lesser than that at
> >any time before T2.
> 
> I just mean the st_size of fstat maybe less than the previously read value in
> that time, and it will lead to the "tail truncated" issue. Do you agree with
> me?

Yes. But, in your example there is only one fstat. For this to happen we need 
atleast two fstats and the latest st_size should be less than the oldest one. 
Am I missing anything here?

> 
> >As to the relative ordering of write at T0 and fstat at T2, POSIX leaves it
> >undefined. Unless write and fstat happen from same
> >thread/single-threaded-application there is no requirement for maintaining
> >that order (If they are issued from same thread fstat should account write
> >at T0). Also note that it is not mentioned here fstat at T2 is issued
> >_after_ write at T0 is _complete_. If that is the case, mdc_writev_cbk
> >would've updated correct stat in cache and fstat would get correct value.
> >If it is not the case, then there is no well defined order here.
> 
> >So, I don't think there is a bug here, unless I've missed out something.
> 
> Do you mean the GlusterFS not conflict with the requirement, so that the
> application like "tail" should consider the case in network file system?

No. Applications shouldn't do anything different to work on Glusterfs. 
Otherwise its a bug :). What I am saying is that the issue with 'tail -f' might 
be because of a different bug than the example you gave. In other words, the 
RCA you posted may not be correct. It might be because of issues with 
write-behind (and other xlators) as I posted in other mail.

Priliminary testing by Pranith showed that Elasticsearch works fine with just 
write-behind. So, that's a progress. Will keep you posted with our efforts on 
getting Elasticsearch working on Gluster. I've a feeling that, it will solve 
your issue (tail -f) too.

regards,
Raghavendra

> 
> @Jan & @Bingxuan, do you have some comments for the above information?
> 
> 
> Best Regards,
> George
> 
> -Original Message-
> From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com]
> Sent: Monday, October 31, 2016 6:35 PM
> To: Lian, George (Nokia - CN/Hangzhou) 
> Cc: Pranith Kumar Karampuri ;
> I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> ; Zhang, Bingxuan (Nokia -
> CN/Hangzhou) ; Gluster-devel@gluster.org; Zizka,
> Jan (Nokia - CZ/Prague) 
> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the
> really size of the syslog file
> 
> 
> 
> - Original Message -
> > From: "George Lian (Nokia - CN/Hangzhou)" 
> > To: "Pranith Kumar Karampuri" , "Raghavendra Gowdappa"
> > 
> > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> > , "Bingxuan Zhang (Nokia
> > - CN/Hangzhou)" , Gluster-devel@gluster.org, "Jan
> > Zizka (Nokia - CZ/Prague)"
> > 
> > Sent: Monday, October 31, 2016 2:32:34 PM
> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the
> > really size of the syslog file
> > 
> > Hi,
> > 
> > I suppose there seems a defect on mdc_writev_cbk  and mdc_fstat
> > Let’s assume in 2 timestamp which called write and fstat operation in
> > application:
> > T0:  write (process a)
> > T1: read (process b) with the data of T0 of process a.
> > T2: fstat   (process c)
> > In my view, mdc_write is non-block operation and have some lock to protect
> > in
> > afr xlator,  because mdc_fstat not check the lock in AFR xaltor, so
> > mdc_writev_cbk which called “mdc_inode_iatt_set_validate” maybe later than
> > mdc_fstat.
> > Such like
> > T3: fstat result of T2  without the “mdc_inode_iatt_set_validate” of T0
> > when
> > stat-prefetch options is on.
> > T4: “mdc_inode_iatt_set_validate” is called of T0 in mdc_writev_cbk.
> > 
> > Lets’ assume T0 > 

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Lian, George (Nokia - CN/Hangzhou)
Hi, Raghavendra,

Thanks a lots for your update!

>IIUC, the "tail issue" can happen if 'tail -f' reads a stat with st_size 
>lesser than previously read value (and hence the complaint - file truncated). 
>In this case, even though fstat at T2 doesn't account the write at T0, it 
>doesn't prove that st_size of fstat at T2 is lesser than that at any time 
>before T2.

I just mean the st_size of fstat maybe less than the previously read value in 
that time, and it will lead to the "tail truncated" issue. Do you agree with me?

>As to the relative ordering of write at T0 and fstat at T2, POSIX leaves it 
>undefined. Unless write and fstat happen from same 
>thread/single-threaded-application there is no requirement for maintaining 
>that order (If they are issued from same thread fstat should account write at 
>T0). Also note that it is not mentioned here fstat at T2 is issued _after_ 
>write at T0 is _complete_. If that is the case, mdc_writev_cbk would've 
>updated correct stat in cache and fstat would get correct value. If it is not 
>the case, then there is no well defined order here.

>So, I don't think there is a bug here, unless I've missed out something.

Do you mean the GlusterFS not conflict with the requirement, so that the 
application like "tail" should consider the case in network file system?

@Jan & @Bingxuan, do you have some comments for the above information?


Best Regards,
George

-Original Message-
From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com] 
Sent: Monday, October 31, 2016 6:35 PM
To: Lian, George (Nokia - CN/Hangzhou) 
Cc: Pranith Kumar Karampuri ; 
I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
; Zhang, Bingxuan (Nokia - 
CN/Hangzhou) ; Gluster-devel@gluster.org; Zizka, Jan 
(Nokia - CZ/Prague) 
Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file



- Original Message -
> From: "George Lian (Nokia - CN/Hangzhou)" 
> To: "Pranith Kumar Karampuri" , "Raghavendra Gowdappa" 
> 
> Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS" 
> , "Bingxuan Zhang (Nokia
> - CN/Hangzhou)" , Gluster-devel@gluster.org, "Jan 
> Zizka (Nokia - CZ/Prague)"
> 
> Sent: Monday, October 31, 2016 2:32:34 PM
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
> really size of the syslog file
> 
> Hi,
> 
> I suppose there seems a defect on mdc_writev_cbk  and mdc_fstat
> Let’s assume in 2 timestamp which called write and fstat operation in
> application:
> T0:  write (process a)
> T1: read (process b) with the data of T0 of process a.
> T2: fstat   (process c)
> In my view, mdc_write is non-block operation and have some lock to protect in
> afr xlator,  because mdc_fstat not check the lock in AFR xaltor, so
> mdc_writev_cbk which called “mdc_inode_iatt_set_validate” maybe later than
> mdc_fstat.
> Such like
> T3: fstat result of T2  without the “mdc_inode_iatt_set_validate” of T0 when
> stat-prefetch options is on.
> T4: “mdc_inode_iatt_set_validate” is called of T0 in mdc_writev_cbk.
> 
> Lets’ assume T0 in multi-process environment and the load of CPU is high?
> If it is reasonable, then issue of “tail issue” will be happened.

IIUC, the "tail issue" can happen if 'tail -f' reads a stat with st_size lesser 
than previously read value (and hence the complaint - file truncated). In this 
case, even though fstat at T2 doesn't account the write at T0, it doesn't prove 
that st_size of fstat at T2 is lesser than that at any time before T2.

As to the relative ordering of write at T0 and fstat at T2, POSIX leaves it 
undefined. Unless write and fstat happen from same 
thread/single-threaded-application there is no requirement for maintaining that 
order (If they are issued from same thread fstat should account write at T0). 
Also note that it is not mentioned here fstat at T2 is issued _after_ write at 
T0 is _complete_. If that is the case, mdc_writev_cbk would've updated correct 
stat in cache and fstat would get correct value. If it is not the case, then 
there is no well defined order here.

So, I don't think there is a bug here, unless I've missed out something.


> 
> So maybe a fix suggestion is on mdc_fstat operation , we should add an
> operation to check whether the writev operation is ongoing or not, if
> write-operation is ongoing, should goto uncached label in mdc_fstat
> function.
> 
> Could you please confirm the above assumption and suggestion?
> 
> 
> Thanks & Best Regards,
> George
> 
> 
> From: Lian, George (Nokia - CN/Hangzhou)
> Sent: Monday, October 31, 2016 4:25 PM
> To: Pranith Kumar Karampuri ; Raghavendra Gowdappa
> 

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Lian, George (Nokia - CN/Hangzhou)
Hi,

How can we enable debug.trace so that we can inspect the debug data on 
different xlator?
I just set “debug.trace on” and “debug.log-file yes” seems not work now.

And one more update for this issue, if we set performance.stat-prefetch to off, 
the issue will not be occurred. (our previous test maybe not correct☺ )

Thanks & Best Regards,
George

From: Pranith Kumar Karampuri [mailto:pkara...@redhat.com]
Sent: Friday, October 28, 2016 2:39 PM
To: Lian, George (Nokia - CN/Hangzhou) 
Cc: Raghavendra Gowdappa ; 
I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
; Zhang, Bingxuan (Nokia - 
CN/Hangzhou) ; Gluster-devel@gluster.org; Zizka, Jan 
(Nokia - CZ/Prague) 
Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file

hi George,
   It would help if we can identify the bare minimum xlators which are 
contributing to the issue like Raghavendra was mentioning earlier. We were 
wondering if it is possible for you to help us in identifying the issue by 
running the workload on a modified setup? We can suggest testing out using 
custom volfiles so that we can slowly build the graph which could be causing 
this issue. We would like you guys to try out this problem with just 
posix-xlator and fuse and nothing else.

On Thu, Oct 27, 2016 at 1:40 PM, Lian, George (Nokia - CN/Hangzhou) 
> wrote:
Hi, Raghavendra,

Could you please give some suggestion for this issue? we try to find the clue 
for this issue for a long time, but it has no progress:(

Thanks & Best Regards,
George

-Original Message-
From: Lian, George (Nokia - CN/Hangzhou)
Sent: Wednesday, October 19, 2016 4:40 PM
To: 'Raghavendra Gowdappa' >
Cc: Gluster-devel@gluster.org; 
I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
>;
 Zhang, Bingxuan (Nokia - CN/Hangzhou) 
>; Zizka, Jan (Nokia 
- CZ/Prague) >
Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file

Hi, Raghavendra

Just now, we test it with glusterfs log with debug-level "TRACE", and let some 
application trigger "glusterfs" produce large log, in that case, when we set 
write-behind and stat-prefetch both OFF,
Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed with 
"file truncated",

So that means if file's IO in huge amount, the issue will still be there even 
write-behind and stat-prefetch both OFF.

Best Regards,
George

-Original Message-
From: Raghavendra Gowdappa 
[mailto:rgowd...@redhat.com]
Sent: Wednesday, October 19, 2016 2:54 PM
To: Lian, George (Nokia - CN/Hangzhou) 
>
Cc: Gluster-devel@gluster.org; 
I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
>;
 Zhang, Bingxuan (Nokia - CN/Hangzhou) 
>; Zizka, Jan (Nokia 
- CZ/Prague) >
Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file



- Original Message -
> From: "George Lian (Nokia - CN/Hangzhou)" 
> >
> To: "Raghavendra Gowdappa" >
> Cc: Gluster-devel@gluster.org, 
> "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> >,
>  "Bingxuan Zhang (Nokia - CN/Hangzhou)"
> >, "Jan Zizka 
> (Nokia - CZ/Prague)" >
> Sent: Wednesday, October 19, 2016 12:05:01 PM
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
> really size of the syslog file
>
> Hi, Raghavendra,
>
> Thanks a lots for your quickly update!
> In my case, there are so many process(write) is writing to the syslog file,
> it do involve the writer is in the same host and writing in same mount point
> while the tail(reader) is reading it.
>
> The bug I just guess is:
> When a writer write the data with write-behind, it call the call-back
> function " mdc_writev_cbk" and called "mdc_inode_iatt_set_validate" to
> validate the "iatt" data, but with the code I mentioned last mail, it do
> nothing.

mdc_inode_iatt_set_validate has following code


if 

Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Gandalf Corvotempesta
2016-10-31 12:40 GMT+01:00 Lindsay Mathieson :
> But you can broadcast with UDP - one packet of data through one nic to all
> nodes, so in theory you could broadcast 1GB *per nic* or 3GB via three nics.
> Minus overhead for acks, nacks and ordering :)
>
> But I'm not sure it would work at all in practice now through a switch.

I don't like this idea.
I stil prefere a properly configured bonding. There is a bonding mode
that does exactly this.
Probably, also balance-xor and active-tld could do the trick
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Jeff Darcy
Another thought that just occurred to me: security.  There's no 
broadcast/unicast equivalent of TLS, so you're not going to have that 
protection.  Maybe it doesn't matter in some kinds of deployments, but in 
others it would matter very much.  Also, a similarly-secure broadcast/multicast 
protocol would be a really awesome research project.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Lindsay Mathieson

On 31/10/2016 5:56 PM, Gandalf Corvotempesta wrote:


> I'd like to experiment with broadcast udp to see if its feasible in 
local networks. It would be amazing if we could write at 1GB speeds 
simultaneously to all nodes.

>

Is you have replica 3 and set a 3 nic bonded interface with 
balance-alb on the gluster client,  you are able to use the 3 nics 
simultaneously writing at 1gb on each node.




But you can broadcast with UDP - one packet of data through one nic to 
all nodes, so in theory you could broadcast 1GB *per nic* or 3GB via 
three nics. Minus overhead for acks, nacks and ordering :)



But I'm not sure it would work at all in practice now through a switch.

--
Lindsay Mathieson

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Lindsay Mathieson

On 31/10/2016 5:56 PM, Gandalf Corvotempesta wrote:
Is you have replica 3 and set a 3 nic bonded interface with 
balance-alb on the gluster client,  you are able to use the 3 nics 
simultaneously writing at 1gb on each node.


Actually all you need is two nics, so each node can use 2 nics for the 
other two nodes.



I actually have three nics per node, currently two bonded with 
balance-alb per node and I do indeed max out a 1G connection with Jumbo 
frames. A VM tops out at 120MB/s in seq writes.


I did experiment with 3 nics bonded with balance-rr and managed to get 
2.4Gbs throughput, balance-rr doesn't do to well with bonds bigger than 2.


Unfortunately I need a private IP for gluster and a bridge for the VM's 
and I could only get OpenVSwitch to bond three nics to a bridge and a 
extra IP and OVS doesn't support balance-alb.


--
Lindsay Mathieson

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra G" 
> To: "Lindsay Mathieson" 
> Cc: "Gluster Devel" 
> Sent: Monday, October 31, 2016 11:45:15 AM
> Subject: Re: [Gluster-devel] Custom Transport layers
> 
> 
> 
> On Fri, Oct 28, 2016 at 6:20 PM, Lindsay Mathieson <
> lindsay.mathie...@gmail.com > wrote:
> 
> 
> Is it possible to write custom transport layers for gluster?, data transfer,
> not the management protocols. Pointers to the existing code and/or docs :)
> would be helpful
> 
> 
> I'd like to experiment with broadcast udp to see if its feasible in local
> networks.
> 
> Another thing to consider here is ordering of messages (sent over transport).
> If Broadcast udp doesn't support ordering of messages (I know udp doesn't,
> assuming broadcast udp doesn't too, but I may be wrong). If it doesn't,
> you've to build ordering logic on top of it. If transport layer doesn't
> provide ordering, we cannot reason about consistency of data stored on
> filesystem.

Last time when I thought about UDP vs TCP, I seemed to have stumbled upon 
use-cases where maintaining ordering of messages was necessary. However, now 
that I think more about it, higher layers (like write-behind, open-behind etc) 
maintain order wherever required. So, I am not sure whether ordering of 
messages is a primary requirement when choosing a transport. I'll post more if 
I find anything worthwhile.

> 
> 
> It would be amazing if we could write at 1GB speeds simultaneously to all
> nodes.
> 
> 
> Alternatively let me know if this has been tried and discarded as a bad idea
> ...
> 
> thanks,
> 
> --
> Lindsay Mathieson
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 
> 
> --
> Raghavendra G
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Raghavendra Gowdappa


- Original Message -
> From: "George Lian (Nokia - CN/Hangzhou)" 
> To: "Pranith Kumar Karampuri" , "Raghavendra Gowdappa" 
> 
> Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS" 
> , "Bingxuan Zhang (Nokia
> - CN/Hangzhou)" , Gluster-devel@gluster.org, "Jan 
> Zizka (Nokia - CZ/Prague)"
> 
> Sent: Monday, October 31, 2016 2:32:34 PM
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
> really size of the syslog file
> 
> Hi,
> 
> I suppose there seems a defect on mdc_writev_cbk  and mdc_fstat
> Let’s assume in 2 timestamp which called write and fstat operation in
> application:
> T0:  write (process a)
> T1: read (process b) with the data of T0 of process a.
> T2: fstat   (process c)
> In my view, mdc_write is non-block operation and have some lock to protect in
> afr xlator,  because mdc_fstat not check the lock in AFR xaltor, so
> mdc_writev_cbk which called “mdc_inode_iatt_set_validate” maybe later than
> mdc_fstat.
> Such like
> T3: fstat result of T2  without the “mdc_inode_iatt_set_validate” of T0 when
> stat-prefetch options is on.
> T4: “mdc_inode_iatt_set_validate” is called of T0 in mdc_writev_cbk.
> 
> Lets’ assume T0 in multi-process environment and the load of CPU is high?
> If it is reasonable, then issue of “tail issue” will be happened.

IIUC, the "tail issue" can happen if 'tail -f' reads a stat with st_size lesser 
than previously read value (and hence the complaint - file truncated). In this 
case, even though fstat at T2 doesn't account the write at T0, it doesn't prove 
that st_size of fstat at T2 is lesser than that at any time before T2.

As to the relative ordering of write at T0 and fstat at T2, POSIX leaves it 
undefined. Unless write and fstat happen from same 
thread/single-threaded-application there is no requirement for maintaining that 
order (If they are issued from same thread fstat should account write at T0). 
Also note that it is not mentioned here fstat at T2 is issued _after_ write at 
T0 is _complete_. If that is the case, mdc_writev_cbk would've updated correct 
stat in cache and fstat would get correct value. If it is not the case, then 
there is no well defined order here.

So, I don't think there is a bug here, unless I've missed out something.


> 
> So maybe a fix suggestion is on mdc_fstat operation , we should add an
> operation to check whether the writev operation is ongoing or not, if
> write-operation is ongoing, should goto uncached label in mdc_fstat
> function.
> 
> Could you please confirm the above assumption and suggestion?
> 
> 
> Thanks & Best Regards,
> George
> 
> 
> From: Lian, George (Nokia - CN/Hangzhou)
> Sent: Monday, October 31, 2016 4:25 PM
> To: Pranith Kumar Karampuri ; Raghavendra Gowdappa
> 
> Cc: I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> ; Zhang, Bingxuan (Nokia -
> CN/Hangzhou) ; Gluster-devel@gluster.org; Zizka,
> Jan (Nokia - CZ/Prague) 
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the
> really size of the syslog file
> 
> Hi,
> 
> How can we enable debug.trace so that we can inspect the debug data on
> different xlator?
> I just set “debug.trace on” and “debug.log-file yes” seems not work now.
> 
> And one more update for this issue, if we set performance.stat-prefetch to
> off, the issue will not be occurred. (our previous test maybe not correct☺ )
> 
> Thanks & Best Regards,
> George
> 
> From: Pranith Kumar Karampuri [mailto:pkara...@redhat.com]
> Sent: Friday, October 28, 2016 2:39 PM
> To: Lian, George (Nokia - CN/Hangzhou)
> >
> Cc: Raghavendra Gowdappa >;
> I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> >;
> Zhang, Bingxuan (Nokia - CN/Hangzhou)
> >;
> Gluster-devel@gluster.org; Zizka, Jan
> (Nokia - CZ/Prague) >
> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the
> really size of the syslog file
> 
> hi George,
>It would help if we can identify the bare minimum xlators which are
>contributing to the issue like Raghavendra was mentioning earlier. We
>were wondering if it is possible for you to help us in identifying
>the issue by running the workload on a modified setup? We can suggest
>testing out using custom volfiles so that we can slowly build the
>graph which could be causing this issue. We would like 

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-10-31 Thread Lian, George (Nokia - CN/Hangzhou)
Hi,

I suppose there seems a defect on mdc_writev_cbk  and mdc_fstat
Let’s assume in 2 timestamp which called write and fstat operation in 
application:
T0:  write (process a)
T1: read (process b) with the data of T0 of process a.
T2: fstat   (process c)
In my view, mdc_write is non-block operation and have some lock to protect in 
afr xlator,  because mdc_fstat not check the lock in AFR xaltor, so 
mdc_writev_cbk which called “mdc_inode_iatt_set_validate” maybe later than 
mdc_fstat.
Such like
T3: fstat result of T2  without the “mdc_inode_iatt_set_validate” of T0 when 
stat-prefetch options is on.
T4: “mdc_inode_iatt_set_validate” is called of T0 in mdc_writev_cbk.

Lets’ assume T0; Raghavendra Gowdappa 

Cc: I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
; Zhang, Bingxuan (Nokia - 
CN/Hangzhou) ; Gluster-devel@gluster.org; Zizka, Jan 
(Nokia - CZ/Prague) 
Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file

Hi,

How can we enable debug.trace so that we can inspect the debug data on 
different xlator?
I just set “debug.trace on” and “debug.log-file yes” seems not work now.

And one more update for this issue, if we set performance.stat-prefetch to off, 
the issue will not be occurred. (our previous test maybe not correct☺ )

Thanks & Best Regards,
George

From: Pranith Kumar Karampuri [mailto:pkara...@redhat.com]
Sent: Friday, October 28, 2016 2:39 PM
To: Lian, George (Nokia - CN/Hangzhou) 
>
Cc: Raghavendra Gowdappa >; 
I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
>;
 Zhang, Bingxuan (Nokia - CN/Hangzhou) 
>; 
Gluster-devel@gluster.org; Zizka, Jan (Nokia 
- CZ/Prague) >
Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file

hi George,
   It would help if we can identify the bare minimum xlators which are 
contributing to the issue like Raghavendra was mentioning earlier. We were 
wondering if it is possible for you to help us in identifying the issue by 
running the workload on a modified setup? We can suggest testing out using 
custom volfiles so that we can slowly build the graph which could be causing 
this issue. We would like you guys to try out this problem with just 
posix-xlator and fuse and nothing else.

On Thu, Oct 27, 2016 at 1:40 PM, Lian, George (Nokia - CN/Hangzhou) 
> wrote:
Hi, Raghavendra,

Could you please give some suggestion for this issue? we try to find the clue 
for this issue for a long time, but it has no progress:(

Thanks & Best Regards,
George

-Original Message-
From: Lian, George (Nokia - CN/Hangzhou)
Sent: Wednesday, October 19, 2016 4:40 PM
To: 'Raghavendra Gowdappa' >
Cc: Gluster-devel@gluster.org; 
I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS 
>;
 Zhang, Bingxuan (Nokia - CN/Hangzhou) 
>; Zizka, Jan (Nokia 
- CZ/Prague) >
Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the 
really size of the syslog file

Hi, Raghavendra

Just now, we test it with glusterfs log with debug-level "TRACE", and let some 
application trigger "glusterfs" produce large log, in that case, when we set 
write-behind and stat-prefetch both OFF,
Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed with 
"file truncated",

So that means if file's IO in huge amount, the issue will still be there even 
write-behind and stat-prefetch both OFF.

Best Regards,
George

-Original Message-
From: Raghavendra Gowdappa 
[mailto:rgowd...@redhat.com]
Sent: Wednesday, October 19, 2016 2:54 PM
To: Lian, George (Nokia - 

Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Gandalf Corvotempesta
Il 28 ott 2016 2:50 PM, "Lindsay Mathieson" 
ha scritto:
>
> I'd like to experiment with broadcast udp to see if its feasible in local
networks. It would be amazing if we could write at 1GB speeds
simultaneously to all nodes.
>

Is you have replica 3 and set a 3 nic bonded interface with balance-alb on
the gluster client,  you are able to use the 3 nics simultaneously writing
at 1gb on each node.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Raghavendra G
On Fri, Oct 28, 2016 at 6:20 PM, Lindsay Mathieson <
lindsay.mathie...@gmail.com> wrote:

> Is it possible to write custom transport layers for gluster?, data
> transfer, not the management protocols. Pointers to the existing code
> and/or docs :) would be helpful
>
>
> I'd like to experiment with broadcast udp to see if its feasible in local
> networks.


Another thing to consider here is ordering of messages (sent over
transport). If Broadcast udp doesn't support ordering of messages (I know
udp doesn't, assuming broadcast udp doesn't too, but I may be wrong). If it
doesn't, you've to build ordering logic on top of it. If transport layer
doesn't provide ordering, we cannot reason about consistency of data stored
on filesystem.


> It would be amazing if we could write at 1GB speeds simultaneously to all
> nodes.
>
>
> Alternatively let me know if this has been tried and discarded as a bad
> idea ...
>
> thanks,
>
> --
> Lindsay Mathieson
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel