Re: GIT get corrupted on lustre

2013-02-04 Thread Eric Chamberland

Hi,

On 01/23/2013 01:34 PM, Sébastien Boisvert wrote:

Hello,

Here is a patch (with git format-patch) that removes any timer if
NO_SETITIMER is set.



Even with the patch, I finally got an error... :-/

Here are the log (strace -f) of a clean execution and one with the error:

http://www.giref.ulaval.ca/~ericc/NO_SETITIMER-patch_bin_git_noerror.txt.gz

http://www.giref.ulaval.ca/~ericc/NO_SETITIMER-patch_bin_git_with_error.txt.gz

Eric

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Sébastien Boisvert

On 01/22/2013 05:14 PM, Thomas Rast wrote:

Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:


So, hum, do we have some sort of conclusion?

Shall it be a fix for git to get around that lustre behavior?

If something can be done in git it would be great: it is a *lot*
easier to change git than the lustre filesystem software for a cluster
in running in production mode... (words from cluster team) :-/


I thought you already established that simply disabling the progress
display is a sufficient workaround?  If that doesn't help, you can try
patching out all use of SIGALRM within git.



In git (9591fcc6d66), I have found these SIGALRM signal handling:

builtin/log.c:268:  sigaction(SIGALRM, sa, NULL);
builtin/log.c:285:  signal(SIGALRM, SIG_IGN);
compat/mingw.c:1590:mingw_raise(SIGALRM);
compat/mingw.c:1666:if (sig != SIGALRM)
compat/mingw.c:1668:error(sigaction only implemented for 
SIGALRM);
compat/mingw.c:1683:case SIGALRM:
compat/mingw.c:1702:case SIGALRM:
compat/mingw.c:1706:exit(128 + SIGALRM);
compat/mingw.c:1708:timer_fn(SIGALRM);
compat/mingw.h:42:#define SIGALRM 14
perl/Git/SVN.pm:2121:   SIGALRM, SIGUSR1, SIGUSR2);
progress.c:56:  sigaction(SIGALRM, sa, NULL);
progress.c:68:  signal(SIGALRM, SIG_IGN);


I suppose that compat/mingw.{h,c} and SVN.pm can be ignored as our patch to work
around this problem won't be pushed upstream because the real problem is not in 
git, right ?

If I understand correctly, some VFS system calls get interrupted by SIGALRM, 
but when
they resume (via SA_RESTART) they return EINTR. Thomas said that these failed 
calls may need to be retried,
but that open(O_CREAT|O_EXCL) is still tricky around this case.


progress.c SIGALRM code paths are for progress and therefore are required, 
right ?

builtin/log.c SIGALRM code paths are for early output, and the comments in the 
code say that

   If we can get the whole output in less than a tenth of a second, don't even 
bother doing the
early-output thing.


So where do I start for the patch ?


Other than that I agree with Junio, from what we've seen so far, Lustre
returns EINTR on all sorts of calls that simply aren't allowed to do so.




--
---
Spécialiste en granularité (1 journée / semaine)
Calcul Québec / Calcul Canada
Pavillon Adrien-Pouliot, Université Laval, Québec (Québec), Canada
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Sébastien Boisvert

[I forgot to subscribe to the git mailing list, sorry for that]

On 01/22/2013 05:14 PM, Thomas Rast wrote:

Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:


So, hum, do we have some sort of conclusion?

Shall it be a fix for git to get around that lustre behavior?

If something can be done in git it would be great: it is a *lot*
easier to change git than the lustre filesystem software for a cluster
in running in production mode... (words from cluster team) :-/


I thought you already established that simply disabling the progress
display is a sufficient workaround?  If that doesn't help, you can try
patching out all use of SIGALRM within git.



In git (9591fcc6d66), I have found these SIGALRM signal handling:

builtin/log.c:268:sigaction(SIGALRM, sa, NULL);
builtin/log.c:285:signal(SIGALRM, SIG_IGN);
compat/mingw.c:1590:mingw_raise(SIGALRM);
compat/mingw.c:1666:if (sig != SIGALRM)
compat/mingw.c:1668:error(sigaction only implemented for SIGALRM);
compat/mingw.c:1683:case SIGALRM:
compat/mingw.c:1702:case SIGALRM:
compat/mingw.c:1706:exit(128 + SIGALRM);
compat/mingw.c:1708:timer_fn(SIGALRM);
compat/mingw.h:42:#define SIGALRM 14
perl/Git/SVN.pm:2121:SIGALRM, SIGUSR1, SIGUSR2);
progress.c:56:sigaction(SIGALRM, sa, NULL);
progress.c:68:signal(SIGALRM, SIG_IGN);


I suppose that compat/mingw.{h,c} and SVN.pm can be ignored as our patch to work
around this problem won't be pushed upstream because the real problem is not in 
git, right ?

If I understand correctly, some VFS system calls get interrupted by SIGALRM, 
but when
they resume (via SA_RESTART) they return EINTR. Thomas said that these failed 
calls may need to be retried,
but that open(O_CREAT|O_EXCL) is still tricky around this case.


progress.c SIGALRM code paths are for progress and therefore are required, 
right ?

builtin/log.c SIGALRM code paths are for early output, and the comments in the 
code say that

   If we can get the whole output in less than a tenth of a second, don't even 
bother doing the
early-output thing.


So where do I start for the patch ?


Other than that I agree with Junio, from what we've seen so far, Lustre
returns EINTR on all sorts of calls that simply aren't allowed to do so.




--
---
Spécialiste en granularité (1 journée / semaine)
Calcul Québec / Calcul Canada
Pavillon Adrien-Pouliot, Université Laval, Québec (Québec), Canada
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Erik Faye-Lund
On Tue, Jan 22, 2013 at 11:14 PM, Thomas Rast tr...@student.ethz.ch wrote:
 Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:

 Other than that I agree with Junio, from what we've seen so far, Lustre
 returns EINTR on all sorts of calls that simply aren't allowed to do so.

I don't think this analysis is 100% accurate, POSIX allows error codes
to be generated other than those defined. From
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html:

Implementations may support additional errors not included in this
list, *may generate errors included in this list under circumstances
other than those described here*, or may contain extensions or
limitations that prevent some errors from occurring.

So I don't think Lustre violates POSIX by erroring with errno=EINTR,
but I also think guarding every single function call for EINTR just to
be safe to be insane :)

However, looking at Eric's log, I can't see that being what has
happened here - grepping it for EINTR does not produce a single match.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Thomas Rast
Erik Faye-Lund kusmab...@gmail.com writes:

 On Tue, Jan 22, 2013 at 11:14 PM, Thomas Rast tr...@student.ethz.ch wrote:
 Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:

 Other than that I agree with Junio, from what we've seen so far, Lustre
 returns EINTR on all sorts of calls that simply aren't allowed to do so.

 I don't think this analysis is 100% accurate, POSIX allows error codes
 to be generated other than those defined. From
 http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html:

 Implementations may support additional errors not included in this
 list, *may generate errors included in this list under circumstances
 other than those described here*, or may contain extensions or
 limitations that prevent some errors from occurring.

That same page says, however:

  For functions under the Threads option for which [EINTR] is not listed
  as a possible error condition in this volume of IEEE Std 1003.1-2001,
  an implementation shall not return an error code of [EINTR].

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Erik Faye-Lund
On Wed, Jan 23, 2013 at 4:32 PM, Thomas Rast tr...@student.ethz.ch wrote:
 Erik Faye-Lund kusmab...@gmail.com writes:

 On Tue, Jan 22, 2013 at 11:14 PM, Thomas Rast tr...@student.ethz.ch wrote:
 Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:

 Other than that I agree with Junio, from what we've seen so far, Lustre
 returns EINTR on all sorts of calls that simply aren't allowed to do so.

 I don't think this analysis is 100% accurate, POSIX allows error codes
 to be generated other than those defined. From
 http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html:

 Implementations may support additional errors not included in this
 list, *may generate errors included in this list under circumstances
 other than those described here*, or may contain extensions or
 limitations that prevent some errors from occurring.

 That same page says, however:

   For functions under the Threads option for which [EINTR] is not listed
   as a possible error condition in this volume of IEEE Std 1003.1-2001,
   an implementation shall not return an error code of [EINTR].

Yes, but surely that's for pthreads functions, no? utime is not one of
those functions...
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Thomas Rast
Erik Faye-Lund kusmab...@gmail.com writes:

 On Wed, Jan 23, 2013 at 4:32 PM, Thomas Rast tr...@student.ethz.ch wrote:
 Erik Faye-Lund kusmab...@gmail.com writes:

 POSIX allows error codes
 to be generated other than those defined. From
 http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html:

 Implementations may support additional errors not included in this
 list, *may generate errors included in this list under circumstances
 other than those described here*, or may contain extensions or
 limitations that prevent some errors from occurring.

 That same page says, however:

   For functions under the Threads option for which [EINTR] is not listed
   as a possible error condition in this volume of IEEE Std 1003.1-2001,
   an implementation shall not return an error code of [EINTR].

 Yes, but surely that's for pthreads functions, no? utime is not one of
 those functions...

Ah, my bad.  In fact in

  http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html

there is a paragraph Signal Effects on Other Functions, which says

  The most common behavior of an interrupted function after a
  signal-catching function returns is for the interrupted function to
  give an [EINTR] error unless the SA_RESTART flag is in effect for the
  signal. However, there are a number of specific exceptions, including
  sleep() and certain situations with read() and write().

  The historical implementations of many functions defined by IEEE Std
  1003.1-2001 are not interruptible[...]

  Functions not mentioned explicitly as interruptible may be so on some
  implementations, possibly as an extension where the function gives an
  [EINTR] error. There are several functions (for example, getpid(),
  getuid()) that are specified as never returning an error, which can
  thus never be extended in this way.

  If a signal-catching function returns while the SA_RESTART flag is in
  effect, an interrupted function is restarted at the point it was
  interrupted. Conforming applications cannot make assumptions about the
  internal behavior of interrupted functions, even if the functions are
  async-signal-safe. For example, suppose the read() function is
  interrupted with SA_RESTART in effect, the signal-catching function
  closes the file descriptor being read from and returns, and the read()
  function is then restarted; in this case the application cannot assume
  that the read() function will give an [EBADF] error, since read()
  might have checked the file descriptor for validity before being
  interrupted.

Taken together this should mean that the bug is in fact simply that the
calls do not *restart*.  They are (like you say) allowed to return EINTR
despite not being specified to, *but* SA_RESTART should restart it.

Now, does that make it a lustre bug or a glibc bug? :-)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Erik Faye-Lund
On Wed, Jan 23, 2013 at 4:44 PM, Thomas Rast tr...@inf.ethz.ch wrote:
 Erik Faye-Lund kusmab...@gmail.com writes:

 On Wed, Jan 23, 2013 at 4:32 PM, Thomas Rast tr...@student.ethz.ch wrote:
 Erik Faye-Lund kusmab...@gmail.com writes:

 POSIX allows error codes
 to be generated other than those defined. From
 http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html:

 Implementations may support additional errors not included in this
 list, *may generate errors included in this list under circumstances
 other than those described here*, or may contain extensions or
 limitations that prevent some errors from occurring.

 That same page says, however:

   For functions under the Threads option for which [EINTR] is not listed
   as a possible error condition in this volume of IEEE Std 1003.1-2001,
   an implementation shall not return an error code of [EINTR].

 Yes, but surely that's for pthreads functions, no? utime is not one of
 those functions...

 Ah, my bad.  In fact in

   http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html

 there is a paragraph Signal Effects on Other Functions, which says

 snip

 Taken together this should mean that the bug is in fact simply that the
 calls do not *restart*.  They are (like you say) allowed to return EINTR
 despite not being specified to, *but* SA_RESTART should restart it.


Right, thanks for clearing that up.

 Now, does that make it a lustre bug or a glibc bug? :-)

That's kind of uninteresting, the important bit is that it is indeed a
bug (outside of Git).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Jonathan Nieder
Thomas Rast wrote:

 Taken together this should mean that the bug is in fact simply that the
 calls do not *restart*.  They are (like you say) allowed to return EINTR
 despite not being specified to, *but* SA_RESTART should restart it.

 Now, does that make it a lustre bug or a glibc bug? :-)

The kernel takes care of SA_RESTART, if I remember correctly.  (See
arch/x86/kernel/signal.c::handle_signal() case -ERESTARTSYS.)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-23 Thread Sébastien Boisvert

Hello,

Here is a patch (with git format-patch) that removes any timer if NO_SETITIMER 
is set.


Éric:

To test it with your workflow:

$ module load apps/git/1.8.1.1.348.g78eb407-NO_SETITIMER-patch

$ git clone ...


  Sébastien


On 01/22/2013 05:14 PM, Thomas Rast wrote:

Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:


So, hum, do we have some sort of conclusion?

Shall it be a fix for git to get around that lustre behavior?

If something can be done in git it would be great: it is a *lot*
easier to change git than the lustre filesystem software for a cluster
in running in production mode... (words from cluster team) :-/


I thought you already established that simply disabling the progress
display is a sufficient workaround?  If that doesn't help, you can try
patching out all use of SIGALRM within git.

Other than that I agree with Junio, from what we've seen so far, Lustre
returns EINTR on all sorts of calls that simply aren't allowed to do so.




--
---
Spécialiste en granularité (1 journée / semaine)
Calcul Québec / Calcul Canada
Pavillon Adrien-Pouliot, Université Laval, Québec (Québec), Canada
From 78eb4075d98eb9cdc57210c63b8d8de8a3d0cd9e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?S=C3=A9bastien=20Boisvert?= sebastien.boisv...@calculquebec.ca
Date: Wed, 23 Jan 2013 13:10:57 -0500
Subject: [PATCH] don't use timers if NO_SETITIMER is set
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With NO_SETITIMER, the user experience on legacy Lustre is fixed,
but there is no early progress.

The patch has no effect on the resulting git executable if NO_SETITIMER is
not set (the default). So by default this patch has no effect at all, which
is good.

git tests:

$ make clean
$ make NO_SETITIMER=YesPlease
$ make test NO_SETITIMER=YesPlease  make-test.log

$ grep ^not ok make-test.log |grep -v # TODO known breakage|wc -l
0
$ grep ^ok make-test.log |wc -l
9531
$ grep ^not ok make-test.log |wc -l
65

No timers with NO_SETITIMER:

$ objdump -d ./git|grep setitimer|wc -l
0
$ objdump -d ./git|grep alarm|wc -l
0

Timers without NO_SETITIMER:

$ objdump -d /software/apps/git/1.8.1/bin/git|grep setitimer|wc -l
5
$ objdump -d /software/apps/git/1.8.1/bin/git|grep alarm|wc -l
0

Signed-off-by: Sébastien Boisvert sebastien.boisv...@calculquebec.ca
---
 builtin/log.c |7 +++
 daemon.c  |6 ++
 progress.c|8 
 upload-pack.c |2 ++
 4 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 8f0b2e8..f8321c7 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -198,7 +198,9 @@ static void show_early_header(struct rev_info *rev, const char *stage, int nr)
 	printf(_(Final output: %d %s\n), nr, stage);
 }
 
+#ifndef NO_SETITIMER
 static struct itimerval early_output_timer;
+#endif
 
 static void log_show_early(struct rev_info *revs, struct commit_list *list)
 {
@@ -240,9 +242,12 @@ static void log_show_early(struct rev_info *revs, struct commit_list *list)
 	 * trigger every second even if we're blocked on a
 	 * reader!
 	 */
+
+	#ifndef NO_SETITIMER
 	early_output_timer.it_value.tv_sec = 0;
 	early_output_timer.it_value.tv_usec = 50;
 	setitimer(ITIMER_REAL, early_output_timer, NULL);
+	#endif
 }
 
 static void early_output(int signal)
@@ -274,9 +279,11 @@ static void setup_early_output(struct rev_info *rev)
 	 *
 	 * This is a one-time-only trigger.
 	 */
+	#ifndef NO_SETITIMER
 	early_output_timer.it_value.tv_sec = 0;
 	early_output_timer.it_value.tv_usec = 10;
 	setitimer(ITIMER_REAL, early_output_timer, NULL);
+	#endif
 }
 
 static void finish_early_output(struct rev_info *rev)
diff --git a/daemon.c b/daemon.c
index 4602b46..eb82c19 100644
--- a/daemon.c
+++ b/daemon.c
@@ -611,9 +611,15 @@ static int execute(void)
 	if (addr)
 		loginfo(Connection from %s:%s, addr, port);
 
+	#ifndef NO_SETITIMER
 	alarm(init_timeout ? init_timeout : timeout);
+	#endif
+
 	pktlen = packet_read_line(0, line, sizeof(line));
+
+	#ifndef NO_SETITIMER
 	alarm(0);
+	#endif
 
 	len = strlen(line);
 	if (pktlen != len)
diff --git a/progress.c b/progress.c
index 3971f49..b84ccc7 100644
--- a/progress.c
+++ b/progress.c
@@ -45,7 +45,10 @@ static void progress_interval(int signum)
 static void set_progress_signal(void)
 {
 	struct sigaction sa;
+
+	#ifndef NO_SETITIMER
 	struct itimerval v;
+	#endif
 
 	progress_update = 0;
 
@@ -55,16 +58,21 @@ static void set_progress_signal(void)
 	sa.sa_flags = SA_RESTART;
 	sigaction(SIGALRM, sa, NULL);
 
+	#ifndef NO_SETITIMER
 	v.it_interval.tv_sec = 1;
 	v.it_interval.tv_usec = 0;
 	v.it_value = v.it_interval;
 	setitimer(ITIMER_REAL, v, NULL);
+	#endif
 }
 
 static void clear_progress_signal(void)
 {
+	#ifndef NO_SETITIMER
 	struct itimerval v = {{0,},};
 	setitimer(ITIMER_REAL, v, NULL);
+	#endif
+
 	signal(SIGALRM, SIG_IGN);
 	progress_update = 0;
 }
diff --git a/upload-pack.c b/upload-pack.c
index 95d8313..e0b8b32 100644
--- a/upload-pack.c

Re: GIT get corrupted on lustre

2013-01-22 Thread Eric Chamberland

So, hum, do we have some sort of conclusion?

Shall it be a fix for git to get around that lustre behavior?

If something can be done in git it would be great: it is a *lot* easier 
to change git than the lustre filesystem software for a cluster in 
running in production mode... (words from cluster team) :-/


I hope this subject will not die in the list... :-/

Thanks,

Eric



On 01/21/2013 02:29 PM, Thomas Rast wrote:

Please don't drop the Cc list!

Brian J. Murrell br...@interlinx.bc.ca writes:


What's odd is that while I cannot reproduce the original problem, there
seems to be another issue/bug with utime():


I wonder if this is related to http://jira.whamcloud.com/browse/LU-305.
  That was reported as fixed in Lustre 2.0.0 and 2.1.0 but I thought I
saw it on 2.1.1 and added a comment to the above ticket about that.


Aha, that's a very interesting bug report.  My observations support
yours: I managed to get EINTR during utime().


In the absence of it, wouldn't we in theory have to write a simple
loop-on-EINTR wrapper for *all* syscalls?


IIUC, that's what SA_RESTART is all about.


Yes, but there's precious little clear language on when SA_RESTART is
supposed to act.  In all cases?

The wording on

   http://www.delorie.com/gnu/docs/glibc/libc_485.html
   http://www.delorie.com/gnu/docs/glibc/libc_498.html

leads me to believe that SA_RESTART is actually used on the glibc side
of things, so that any glibc syscall wrapper not specifically equipped
with the restarting behavior would return EINTR unmodified.  This might
explain why utime() doesn't restart like it should (assuming we work on
the theory that POSIX doesn't allow an EINTR from utime() to begin
with).



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-22 Thread Junio C Hamano
Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:

 So, hum, do we have some sort of conclusion?

 Shall it be a fix for git to get around that lustre behavior?

 If something can be done in git it would be great: it is a *lot*
 easier to change git than the lustre filesystem software for a cluster
 in running in production mode... (words from cluster team) :-/

Do we know the real cause of the symptom?  I did not follow the
thread carefully, but the impression I was getting was that the
filesystem is broken around EINTR, and even if you fixed Git,
your other more mission critical applications will be broken by
the same filesystem behaviour, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-22 Thread Thomas Rast
Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:

 So, hum, do we have some sort of conclusion?

 Shall it be a fix for git to get around that lustre behavior?

 If something can be done in git it would be great: it is a *lot*
 easier to change git than the lustre filesystem software for a cluster
 in running in production mode... (words from cluster team) :-/

I thought you already established that simply disabling the progress
display is a sufficient workaround?  If that doesn't help, you can try
patching out all use of SIGALRM within git.

Other than that I agree with Junio, from what we've seen so far, Lustre
returns EINTR on all sorts of calls that simply aren't allowed to do so.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-22 Thread Eric Chamberland

On 01/22/2013 05:14 PM, Thomas Rast wrote:

Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:


So, hum, do we have some sort of conclusion?

Shall it be a fix for git to get around that lustre behavior?

If something can be done in git it would be great: it is a *lot*
easier to change git than the lustre filesystem software for a cluster
in running in production mode... (words from cluster team) :-/


I thought you already established that simply disabling the progress
display is a sufficient workaround?  If that doesn't help, you can try
patching out all use of SIGALRM within git.



I tried that solution after Brian told me to try it, but it didn't 
solved the problem for me! :-(



Other than that I agree with Junio, from what we've seen so far, Lustre
returns EINTR on all sorts of calls that simply aren't allowed to do so.



Ok, so now the good move would be to have all this reported to lustre 
development team?  Brian, have you seen anything new from what you have 
already reported?  I have to admit that I am not a fs expert...


And I also agree with Junio point of view: The problem may impact 
mission critical applications


Eric

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-21 Thread Erik Faye-Lund
On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland
eric.chamberl...@giref.ulaval.ca wrote:
 Good idea!

 I did a strace and here is the output with the error:

 http://www.giref.ulaval.ca/~ericc/strace_git_error.txt

 Hope it will be insightful!

This trace doesn't seem to contain child-processes, but instead having
their stderr inlined into the log. Try using strace -f instead...
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-21 Thread Thomas Rast
Erik Faye-Lund kusmab...@gmail.com writes:

 On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland
 eric.chamberl...@giref.ulaval.ca wrote:
 Good idea!

 I did a strace and here is the output with the error:

 http://www.giref.ulaval.ca/~ericc/strace_git_error.txt

 Hope it will be insightful!

 This trace doesn't seem to contain child-processes, but instead having
 their stderr inlined into the log. Try using strace -f instead...

I happen to have access to a lustre FS on the brutus cluster of ETH
Zurich, so I figured I could give it a shot.

What's odd is that while I cannot reproduce the original problem, there
seems to be another issue/bug with utime():

  $ strace -f -o ~/gc.trace git gc
  warning: failed utime() on .git/objects/68/tmp_obj_sCAEVc: Interrupted system 
call
  warning: failed utime() on .git/objects/a6/tmp_obj_3cdB2c: Interrupted system 
call
  warning: failed utime() on .git/objects/69/tmp_obj_lbU3Xc: Interrupted system 
call
  warning: failed utime() on .git/objects/c3/tmp_obj_EU97Wc: Interrupted system 
call
  warning: failed utime() on .git/objects/3e/tmp_obj_tb2j3c: Interrupted system 
call
  warning: failed utime() on .git/objects/15/tmp_obj_e6zMXc: Interrupted system 
call
  warning: failed utime() on .git/objects/54/tmp_obj_ExOJVc: Interrupted system 
call
  warning: failed utime() on .git/objects/e3/tmp_obj_GtPw4c: Interrupted system 
call
  warning: failed utime() on .git/objects/21/tmp_obj_Xex32c: Interrupted system 
call
  warning: failed utime() on .git/objects/1a/tmp_obj_CzwsZc: Interrupted system 
call
  warning: failed utime() on .git/objects/18/tmp_obj_o6fp3c: Interrupted system 
call
  warning: failed utime() on .git/objects/32/tmp_obj_Ih0G4c: Interrupted system 
call
  warning: failed utime() on .git/objects/41/tmp_obj_1RXV1c: Interrupted system 
call
  Counting objects: 137744, done.
  Delta compression using up to 48 threads.
  Compressing objects: 100% (36510/36510), done.
  Writing objects: 100% (137744/137744), done.
  Total 137744 (delta 101591), reused 135512 (delta 99472)

The trace is here (2.1MB compressed):

  http://thomasrast.ch/download/gc.trace.bz2

For the test I used a clone of another git.git I had around.  I think
the error is from sha1_file.c:2564.  While that doesn't look too
important (see ca11b212 for context), it does raise the question: what
other system calls that we never expect to EINTR can do this on
sufficiently arcane system/FS combinations?


Peff's test ran without any apparent issue for a few minutes.  I also
ran an extended version (at the end) that sets alarms, so as to actually
get interrupted.  That proved more interesting.  I had to fix verify()
and write_in_full() to account for EINTR in read()/write(), as those
seem likely to fail.  I also got link() to fail:

  $ ~/lustre-peff-reproducer 
  unable to create hard link: Interrupted system call
  unable to open index file: No such file or directory

but it took a long time.  Unfortunately, when running it with strace I
managed to panic the host I ran it on:

  $ strace -o ~/peff-reproducer.trace ~/lustre-peff-reproducer 

  Message from syslogd@brutus1 at Jan 21 17:09:43 ...   
 
   kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) ASSERTION( 
ols-ols_state == OLS_NEW ) failed: Impossible state: 4

  Message from syslogd@brutus1 at Jan 21 17:09:43 ...
   kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) LBUG

  Message from syslogd@brutus1 at Jan 21 17:09:43 ...
   kernel:Kernel panic - not syncing: LBUG

Yay for now having to explain this to the cluster team.


I tried finding a standard that limits the syscalls to which EINTR
applies, without too much success.  I'm not sure how far I should trust
my manpages, but while some of them explicitly list EINTR as a possible
error (read, write, etc.) link() does not.  (And the linux manpages
agree with the POSOIX ones for once.)

If somebody finds such a standard, we could of course use it to blame
lustre instead :-)

In the absence of it, wouldn't we in theory have to write a simple
loop-on-EINTR wrapper for *all* syscalls?

Of course there's the added problem that when open(O_CREAT|O_EXCL) fails
with EINTR, it's hard to tell whether a file that may now exist is
indeed yours or some other process's.

--- 8 
#include fcntl.h
#include unistd.h
#include stdlib.h
#include stdio.h
#include string.h
#include sys/time.h
#include signal.h
#include errno.h

struct itimerval itv;

static int randomize(unsigned char *buf, int len)
{
  int i;
  len = rand() % len;
  for (i = 0; i  len; i++)
buf[i] = rand()  0xff;
  return len;
}

static int check_eof(int fd)
{
  int ch;
  int r = read(fd, ch, 1);
  if (r  0) {
perror(read error after expected EOF);
return -1;
  }
  if (r  0) {
fprintf(stderr, extra byte after expected EOF);
return -1;
  }
  return 0;
}

static int verify(int fd, const unsigned char *buf, int len)
{
  while (len) {
char 

Re: GIT get corrupted on lustre

2013-01-21 Thread Maxime Boissonneault

Hi Thomas,
Can you tell me what is the version of the lustre servers and the lustre 
clients ?


Thanks,

Maxime Boissonneault
HPC specialist @ Calcul Québec.

Le 2013-01-21 11:11, Thomas Rast a écrit :

Erik Faye-Lund kusmab...@gmail.com writes:


On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland
eric.chamberl...@giref.ulaval.ca wrote:

Good idea!

I did a strace and here is the output with the error:

http://www.giref.ulaval.ca/~ericc/strace_git_error.txt

Hope it will be insightful!

This trace doesn't seem to contain child-processes, but instead having
their stderr inlined into the log. Try using strace -f instead...

I happen to have access to a lustre FS on the brutus cluster of ETH
Zurich, so I figured I could give it a shot.

What's odd is that while I cannot reproduce the original problem, there
seems to be another issue/bug with utime():

   $ strace -f -o ~/gc.trace git gc
   warning: failed utime() on .git/objects/68/tmp_obj_sCAEVc: Interrupted 
system call
   warning: failed utime() on .git/objects/a6/tmp_obj_3cdB2c: Interrupted 
system call
   warning: failed utime() on .git/objects/69/tmp_obj_lbU3Xc: Interrupted 
system call
   warning: failed utime() on .git/objects/c3/tmp_obj_EU97Wc: Interrupted 
system call
   warning: failed utime() on .git/objects/3e/tmp_obj_tb2j3c: Interrupted 
system call
   warning: failed utime() on .git/objects/15/tmp_obj_e6zMXc: Interrupted 
system call
   warning: failed utime() on .git/objects/54/tmp_obj_ExOJVc: Interrupted 
system call
   warning: failed utime() on .git/objects/e3/tmp_obj_GtPw4c: Interrupted 
system call
   warning: failed utime() on .git/objects/21/tmp_obj_Xex32c: Interrupted 
system call
   warning: failed utime() on .git/objects/1a/tmp_obj_CzwsZc: Interrupted 
system call
   warning: failed utime() on .git/objects/18/tmp_obj_o6fp3c: Interrupted 
system call
   warning: failed utime() on .git/objects/32/tmp_obj_Ih0G4c: Interrupted 
system call
   warning: failed utime() on .git/objects/41/tmp_obj_1RXV1c: Interrupted 
system call
   Counting objects: 137744, done.
   Delta compression using up to 48 threads.
   Compressing objects: 100% (36510/36510), done.
   Writing objects: 100% (137744/137744), done.
   Total 137744 (delta 101591), reused 135512 (delta 99472)

The trace is here (2.1MB compressed):

   http://thomasrast.ch/download/gc.trace.bz2

For the test I used a clone of another git.git I had around.  I think
the error is from sha1_file.c:2564.  While that doesn't look too
important (see ca11b212 for context), it does raise the question: what
other system calls that we never expect to EINTR can do this on
sufficiently arcane system/FS combinations?


Peff's test ran without any apparent issue for a few minutes.  I also
ran an extended version (at the end) that sets alarms, so as to actually
get interrupted.  That proved more interesting.  I had to fix verify()
and write_in_full() to account for EINTR in read()/write(), as those
seem likely to fail.  I also got link() to fail:

   $ ~/lustre-peff-reproducer
   unable to create hard link: Interrupted system call
   unable to open index file: No such file or directory

but it took a long time.  Unfortunately, when running it with strace I
managed to panic the host I ran it on:

   $ strace -o ~/peff-reproducer.trace ~/lustre-peff-reproducer

   Message from syslogd@brutus1 at Jan 21 17:09:43 ...
kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) ASSERTION( 
ols-ols_state == OLS_NEW ) failed: Impossible state: 4

   Message from syslogd@brutus1 at Jan 21 17:09:43 ...
kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) LBUG

   Message from syslogd@brutus1 at Jan 21 17:09:43 ...
kernel:Kernel panic - not syncing: LBUG

Yay for now having to explain this to the cluster team.


I tried finding a standard that limits the syscalls to which EINTR
applies, without too much success.  I'm not sure how far I should trust
my manpages, but while some of them explicitly list EINTR as a possible
error (read, write, etc.) link() does not.  (And the linux manpages
agree with the POSOIX ones for once.)

If somebody finds such a standard, we could of course use it to blame
lustre instead :-)

In the absence of it, wouldn't we in theory have to write a simple
loop-on-EINTR wrapper for *all* syscalls?

Of course there's the added problem that when open(O_CREAT|O_EXCL) fails
with EINTR, it's hard to tell whether a file that may now exist is
indeed yours or some other process's.

--- 8 
#include fcntl.h
#include unistd.h
#include stdlib.h
#include stdio.h
#include string.h
#include sys/time.h
#include signal.h
#include errno.h

struct itimerval itv;

static int randomize(unsigned char *buf, int len)
{
   int i;
   len = rand() % len;
   for (i = 0; i  len; i++)
 buf[i] = rand()  0xff;
   return len;
}

static int check_eof(int fd)
{
   int ch;
   int r = read(fd, ch, 1);
   if (r  0) {
 perror(read error after expected EOF);
 return -1;
   }
   if (r  0) 

Re: GIT get corrupted on lustre

2013-01-21 Thread Thomas Rast
Maxime Boissonneault maxime.boissonnea...@calculquebec.ca writes:

 Hi Thomas,
 Can you tell me what is the version of the lustre servers and the
 lustre clients ?

$ uname -a
Linux brutus4.ethz.ch 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 
2012 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/fs/lustre/version
lustre: 2.3.0
kernel: patchless_client
build:  2.3.0-RC6--PRISTINE-2.6.32-279.14.1.el6.x86_64

I have no idea what the servers are running, I only have client access.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-21 Thread Eric Chamberland

Hi,

It just happened again.  Now I have the strace -f output gzipped here:

http://www.giref.ulaval.ca/~ericc/strace-f_git_error.txt.gz

thanks,

Eric

On 01/21/2013 08:29 AM, Erik Faye-Lund wrote:

On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland
eric.chamberl...@giref.ulaval.ca wrote:

Good idea!

I did a strace and here is the output with the error:

http://www.giref.ulaval.ca/~ericc/strace_git_error.txt

Hope it will be insightful!


This trace doesn't seem to contain child-processes, but instead having
their stderr inlined into the log. Try using strace -f instead...



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-21 Thread Eric Chamberland

On 01/21/2013 12:07 PM, Eric Chamberland wrote:

Hi,

It just happened again.  Now I have the strace -f output gzipped here:

http://www.giref.ulaval.ca/~ericc/strace-f_git_error.txt.gz



I added the strace -f output when non error occurs...

http://www.giref.ulaval.ca/~ericc/strace-f_git_no_error.txt.gz

a kdiff3 can show the differences just before the error...

Eric

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-21 Thread Brian J. Murrell
On 13-01-21 11:11 AM, Thomas Rast wrote:
 
 What's odd is that while I cannot reproduce the original problem, there
 seems to be another issue/bug with utime():

I wonder if this is related to http://jira.whamcloud.com/browse/LU-305.
 That was reported as fixed in Lustre 2.0.0 and 2.1.0 but I thought I
saw it on 2.1.1 and added a comment to the above ticket about that.

 In the absence of it, wouldn't we in theory have to write a simple
 loop-on-EINTR wrapper for *all* syscalls?

IIUC, that's what SA_RESTART is all about.

 Of course there's the added problem that when open(O_CREAT|O_EXCL) fails
 with EINTR, it's hard to tell whether a file that may now exist is
 indeed yours or some other process's.

Or whether it's in a half created state such as I hypothesize in
http://jira.whamcloud.com/browse/LU-2276.

b.




signature.asc
Description: OpenPGP digital signature


Re: GIT get corrupted on lustre

2013-01-21 Thread Thomas Rast
Please don't drop the Cc list!

Brian J. Murrell br...@interlinx.bc.ca writes:

 What's odd is that while I cannot reproduce the original problem, there
 seems to be another issue/bug with utime():

 I wonder if this is related to http://jira.whamcloud.com/browse/LU-305.
  That was reported as fixed in Lustre 2.0.0 and 2.1.0 but I thought I
 saw it on 2.1.1 and added a comment to the above ticket about that.

Aha, that's a very interesting bug report.  My observations support
yours: I managed to get EINTR during utime().

 In the absence of it, wouldn't we in theory have to write a simple
 loop-on-EINTR wrapper for *all* syscalls?

 IIUC, that's what SA_RESTART is all about.

Yes, but there's precious little clear language on when SA_RESTART is
supposed to act.  In all cases?

The wording on

  http://www.delorie.com/gnu/docs/glibc/libc_485.html
  http://www.delorie.com/gnu/docs/glibc/libc_498.html

leads me to believe that SA_RESTART is actually used on the glibc side
of things, so that any glibc syscall wrapper not specifically equipped
with the restarting behavior would return EINTR unmodified.  This might
explain why utime() doesn't restart like it should (assuming we work on
the theory that POSIX doesn't allow an EINTR from utime() to begin
with).

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-18 Thread Eric Chamberland

Good idea!

I did a strace and here is the output with the error:

http://www.giref.ulaval.ca/~ericc/strace_git_error.txt

Hope it will be insightful!

Eric


On 01/17/2013 12:17 PM, Pyeron, Jason J CTR (US) wrote:

Sorry, I am in cygwin mode, and I had crossed wires in my head. 
s/ProcessMon/strace/


-Original Message-
From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On
Behalf Of Maxime Boissonneault
Sent: Thursday, January 17, 2013 11:41 AM
To: Pyeron, Jason J CTR (US)
Cc: Eric Chamberland; Philippe Vaucher; git@vger.kernel.org; Sébastien
Boisvert
Subject: Re: GIT get corrupted on lustre

I don't know of any lustre filesystem that is used on Windows. Barely
anybody uses Windows in the HPC industry.
This is a Linux cluster.

Maxime Boissonneault

Le 2013-01-17 11:40, Pyeron, Jason J CTR (US) a écrit :

-Original Message-
From: Eric Chamberland
Sent: Thursday, January 17, 2013 11:31 AM

On 01/17/2013 09:23 AM, Philippe Vaucher wrote:

Anyone has a new idea?

Did you try Jeff King's code to confirm his idea?

Philippe


Yes I did, but it was running without any problem

I find that my test case is simple (fresh git clone then git gc

in

a
crontab), I bet anyone who has access to a Lustre filesystem can
reproduce the problem...  The problem is to have such a filesystem

to

do
the tests

Stabbing in the dark, but can you log the details with ProcessMon?

http://technet.microsoft.com/en-us/sysinternals/bb896645


But I am available to do it...

-Jason



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-17 Thread Eric Chamberland

Hi!

I still have the corruption problems

We just compiled a git without threads to try... (by the way, 
--without-pthreads doesn't work, you have to do a --disable-pthreads 
instead).


And to remove the warnings about threads at git gc execution, I did a:

git config --local pack.threads 1

and cloned a repository and started to do:

git gc

once every hour.

Then this night (at 05:35:02 exactly), the same error as usual occurred:

error: index file 
.git/objects/pack/pack-bf0748cee64a1964be0a1061c82aca51c993b825.idx is 
too small

error: refs/heads/master does not point to a valid object!

So now I am convinced that it is not a thread problem

I am kind of discouraged, we like to use git, but in this case we have 
this error which seems unsolvable!


Anyone has a new idea?

Thanks,

Eric


On 01/09/2013 04:20 PM, Eric Chamberland wrote:

Hi Brian,

On 01/08/2013 11:11 AM, Eric Chamberland wrote:

On 12/24/2012 10:11 AM, Brian J. Murrell wrote:

Have you tried adding a -q to the git command line to quiet down git's
feedback messages?





I moved to git 1.8.1 and added the -q to the command git gc but it
occured to return an error, so the -q option is not avoiding the
problem here... :-/

command in crontab:

cd /rap/jsf-051-aa/ericc/tests_git_clones/GIREF  for i in seq 10; do
/software/apps/git/1.8.1/bin/git gc -q || true;done

results:
error: index file
.git/objects/pack/pack-1f09879c88cd71a15dcc891713cf038d249830ad.idx is
too small
error: refs/remotes/origin/BIB_Branche_1_4_x does not point to a valid
object!

and this clone was a clean clone in which only git qc -q has been
run on

I still have a doubt on threads

Eric

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-17 Thread Eric Chamberland

On 01/17/2013 09:23 AM, Philippe Vaucher wrote:

Anyone has a new idea?


Did you try Jeff King's code to confirm his idea?

Philippe



Yes I did, but it was running without any problem

I find that my test case is simple (fresh git clone then git gc in a 
crontab), I bet anyone who has access to a Lustre filesystem can 
reproduce the problem...  The problem is to have such a filesystem to do 
the tests


But I am available to do it...

Thanks,

Eric



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: GIT get corrupted on lustre

2013-01-17 Thread Pyeron, Jason J CTR (US)
 -Original Message-
 From: Eric Chamberland
 Sent: Thursday, January 17, 2013 11:31 AM
 
 On 01/17/2013 09:23 AM, Philippe Vaucher wrote:
  Anyone has a new idea?
 
  Did you try Jeff King's code to confirm his idea?
 
  Philippe
 
 
 Yes I did, but it was running without any problem
 
 I find that my test case is simple (fresh git clone then git gc in
 a
 crontab), I bet anyone who has access to a Lustre filesystem can
 reproduce the problem...  The problem is to have such a filesystem to
 do
 the tests

Stabbing in the dark, but can you log the details with ProcessMon?

http://technet.microsoft.com/en-us/sysinternals/bb896645

 
 But I am available to do it...

-Jason


smime.p7s
Description: S/MIME cryptographic signature


Re: GIT get corrupted on lustre

2013-01-17 Thread Maxime Boissonneault
I don't know of any lustre filesystem that is used on Windows. Barely 
anybody uses Windows in the HPC industry.

This is a Linux cluster.

Maxime Boissonneault

Le 2013-01-17 11:40, Pyeron, Jason J CTR (US) a écrit :

-Original Message-
From: Eric Chamberland
Sent: Thursday, January 17, 2013 11:31 AM

On 01/17/2013 09:23 AM, Philippe Vaucher wrote:

Anyone has a new idea?

Did you try Jeff King's code to confirm his idea?

Philippe


Yes I did, but it was running without any problem

I find that my test case is simple (fresh git clone then git gc in
a
crontab), I bet anyone who has access to a Lustre filesystem can
reproduce the problem...  The problem is to have such a filesystem to
do
the tests

Stabbing in the dark, but can you log the details with ProcessMon?

http://technet.microsoft.com/en-us/sysinternals/bb896645


But I am available to do it...

-Jason



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: GIT get corrupted on lustre

2013-01-17 Thread Pyeron, Jason J CTR (US)
Sorry, I am in cygwin mode, and I had crossed wires in my head. 
s/ProcessMon/strace/

 -Original Message-
 From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On
 Behalf Of Maxime Boissonneault
 Sent: Thursday, January 17, 2013 11:41 AM
 To: Pyeron, Jason J CTR (US)
 Cc: Eric Chamberland; Philippe Vaucher; git@vger.kernel.org; Sébastien
 Boisvert
 Subject: Re: GIT get corrupted on lustre
 
 I don't know of any lustre filesystem that is used on Windows. Barely
 anybody uses Windows in the HPC industry.
 This is a Linux cluster.
 
 Maxime Boissonneault
 
 Le 2013-01-17 11:40, Pyeron, Jason J CTR (US) a écrit :
  -Original Message-
  From: Eric Chamberland
  Sent: Thursday, January 17, 2013 11:31 AM
 
  On 01/17/2013 09:23 AM, Philippe Vaucher wrote:
  Anyone has a new idea?
  Did you try Jeff King's code to confirm his idea?
 
  Philippe
 
  Yes I did, but it was running without any problem
 
  I find that my test case is simple (fresh git clone then git gc
 in
  a
  crontab), I bet anyone who has access to a Lustre filesystem can
  reproduce the problem...  The problem is to have such a filesystem
 to
  do
  the tests
  Stabbing in the dark, but can you log the details with ProcessMon?
 
  http://technet.microsoft.com/en-us/sysinternals/bb896645
 
  But I am available to do it...
  -Jason
 
 
 --
 -
 Maxime Boissonneault
 Analyste de calcul - Calcul Québec, Université Laval
 Ph. D. en physique
 
 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


smime.p7s
Description: S/MIME cryptographic signature


Re: GIT get corrupted on lustre

2013-01-09 Thread Eric Chamberland

Hi Brian,

On 01/08/2013 11:11 AM, Eric Chamberland wrote:

On 12/24/2012 10:11 AM, Brian J. Murrell wrote:

Have you tried adding a -q to the git command line to quiet down git's
feedback messages?





I moved to git 1.8.1 and added the -q to the command git gc but it 
occured to return an error, so the -q option is not avoiding the 
problem here... :-/


command in crontab:

cd /rap/jsf-051-aa/ericc/tests_git_clones/GIREF  for i in seq 10; do 
/software/apps/git/1.8.1/bin/git gc -q || true;done


results:
error: index file 
.git/objects/pack/pack-1f09879c88cd71a15dcc891713cf038d249830ad.idx is 
too small
error: refs/remotes/origin/BIB_Branche_1_4_x does not point to a valid 
object!


and this clone was a clean clone in which only git qc -q has been 
run on


I still have a doubt on threads

Eric

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2013-01-08 Thread Eric Chamberland

On 12/24/2012 10:11 AM, Brian J. Murrell wrote:

Have you tried adding a -q to the git command line to quiet down git's
feedback messages?



Ok, I have modified my crontab to use -q and I will wait to see if the 
problem occurs from now.



I discovered other oddities with using git on Lustre which I described
in this thread:

http://thread.gmane.org/gmane.comp.version-control.git/208886

I found that by simply disabling the feedback (which disables the
copious SIGALRM processing) I could alleviate the issue.

I wonder if your issues are more of the same.

I filed Lustre bug LU-2276 about it at:

http://jira.whamcloud.com/browse/LU-2276


Thank you for these informations.  I see the bug is unresolved!...

Eric


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2012-12-26 Thread Jeff King
On Mon, Dec 24, 2012 at 09:08:46AM -0500, Eric Chamberland wrote:

 Doing a git clone always work fine, but when we git pull or git
 gc or git fsck, often (1/5) the local repository get corrupted.
 for example, I got this error two days ago while doing git gc:
 
 error: index file 
 .git/objects/pack/pack-7b43b1c613a851392aaf4f66916dff2577931576.idx is too 
 small
 error: refs/heads/mail_seekable does not point to a valid object!
 [...]
 We think it could be related to the fact that we are on a *Lustre*
 filesystem, which I think doesn't fully support file locking.

I don't think locking is a problem here. The problem is that you have a
corrupt .idx file (the second error is almost certainly an effect of the
first one; git cannot look in the packfile, and therefore cannot find
the object the ref points to). But we do not ever lock the .idx files.
They are generated in tmpfiles and then atomically moved into place
using a hard link.

So if anything, I would suspect that lustre has trouble with the
write/fsync/close/link sequence. Is it possible that it does not keep
the ordering, and readers might see a linked file that is missing some
data? If you wait (or do some synchronizing operation on the filesystem,
like sync, or an unmount/mount), does the repo later work, or is it
broken forever?

 #1) However, how can we *test* the filesystem (lustre) compatibility
 with git? (Is there a unit test we can run?)

Running make test in git.git would be a good start. You could also try
running the C program I'm including below. It repeatedly runs a
write/close/fsync/link sequence like the one that index-pack runs, and
then verifies the result. If it does not run forever without error, that
would be a sign of the possible ordering problem I mentioned above.

 #2) Is there a way to compile GIT to be compatible with lustre? (ex:
 no threads?)

This isn't a known issue, so I don't know offhand what compile flags
might help. The complete list is at the top of Makefile. You might try
with NO_PTHREADS=Yes, but I kind of doubt that threads are at work here.

 #3) If you *know* your filesystem doesn't allow file locking, how
 would you configure/compile GIT to work on it?

I think locking is a red herring here, as it is not used to create the
.idx files at all (and we don't do flock locking anyway; everything
happens via O_EXCL creation).

-Peff

-- 8 --
#include fcntl.h
#include unistd.h
#include stdlib.h
#include stdio.h
#include string.h

static int randomize(unsigned char *buf, int len)
{
  int i;
  len = rand() % len;
  for (i = 0; i  len; i++)
buf[i] = rand()  0xff;
  return len;
}

static int check_eof(int fd)
{
  int ch;
  int r = read(fd, ch, 1);
  if (r  0) {
perror(read error after expected EOF);
return -1;
  }
  if (r  0) {
fprintf(stderr, extra byte after expected EOF);
return -1;
  }
  return 0;
}

static int verify(int fd, const unsigned char *buf, int len)
{
  while (len) {
char to_check[4096];
int got = read(fd, to_check,
   len  sizeof(to_check) ? len : sizeof(to_check));

if (got  0) {
  perror(unable to read);
  return -1;
}
if (got == 0) {
  fprintf(stderr, premature EOF (%d bytes remaining), len);
  return -1;
}
if (memcmp(buf, to_check, got)) {
  fprintf(stderr, bytes differ);
  return -1;
}

buf += got;
len -= got;
  }

  return check_eof(fd);
}

int write_in_full(int fd, const unsigned char *buf, int len)
{
  while (len) {
int r = write(fd, buf, len);
if (r  0)
  return -1;
buf += r;
len -= r;
  }
  return 0;
}

int move_into_place(const char *old, const char *new)
{
  if (link(old, new)  0) {
perror(unable to create hard link);
return 1;
  }
  unlink(old);
  return 0;
}

int main(void)
{
  while (1) {
static unsigned char junk[1024*1024];
int len = randomize(junk, sizeof(junk));
int fd;

/* clean up from any previous round */
unlink(tmpfile);
unlink(final.idx);

fd = open(tmpfile, O_WRONLY|O_CREAT, 0666);
if (fd  0) {
  perror(unable to open tmpfile);
  return 1;
}
if (write_in_full(fd, junk, len)  0 ||
fsync(fd)  0 ||
close(fd)  0) {
  perror(unable to write);
  return 1;
}

if (move_into_place(tmpfile, final.idx)  0)
  return 1;

fd = open(final.idx, O_RDONLY);
if (fd  0) {
  perror(unable to open index file);
  return 1;
}
if (verify(fd, junk, len)  0)
  return 1;
close(fd);
  }
}
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


GIT get corrupted on lustre

2012-12-24 Thread Eric Chamberland

Hi,

we are using git since may and all is working fine for all of us (almost 
20 people) on our workstations.  However, when we clone our repositories 
to the cluster, only and only there

we are having many problems similiar to this post:

http://thread.gmane.org/gmane.comp.file-systems.lustre.user/12093

Doing a git clone always work fine, but when we git pull or git gc 
or git fsck, often (1/5) the local repository get corrupted.

for example, I got this error two days ago while doing git gc:

error: index file 
.git/objects/pack/pack-7b43b1c613a851392aaf4f66916dff2577931576.idx is too small
error: refs/heads/mail_seekable does not point to a valid object!

also, I got this error 5 days ago:

error: index file 
.git/objects/pack/pack-ef9b5bbff1ebc1af63ef4262ade3e18b439c58af.idx is too small
error: refs/heads/mail_seekable does not point to a valid object!
Removing stale temporary file .git/objects/pack/tmp_pack_lO7aw2

and this one some time ago:

Removing stale temporary file .git/objects/pack/tmp_pack_5CHb2F
Removing stale temporary file .git/objects/pack/tmp_pack_GY159g
Removing stale temporary file .git/objects/pack/tmp_pack_aKkXTS

We are using git 1.8.0.1 on CentOS release 5.8 (Final).

We think it could be related to the fact that we are on a *Lustre* 
filesystem, which I think doesn't fully support file locking.


Questions:

#1) However, how can we *test* the filesystem (lustre) compatibility 
with git? (Is there a unit test we can run?)


#2) Is there a way to compile GIT to be compatible with lustre? (ex: no 
threads?)


#3) If you *know* your filesystem doesn't allow file locking, how would 
you configure/compile GIT to work on it?


#4) Anyone has another idea on how to solve this?

Thanks,

Eric

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2012-12-24 Thread Andreas Schwab
Eric Chamberland eric.chamberl...@giref.ulaval.ca writes:

 #1) However, how can we *test* the filesystem (lustre) compatibility with
 git? (Is there a unit test we can run?)

Have you considered running git's testsuite?

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GIT get corrupted on lustre

2012-12-24 Thread Brian J. Murrell
On 12-12-24 09:08 AM, Eric Chamberland wrote:
 Hi,

Hi,

 Doing a git clone always work fine, but when we git pull or git gc
 or git fsck, often (1/5) the local repository get corrupted.

Have you tried adding a -q to the git command line to quiet down git's
feedback messages?

I discovered other oddities with using git on Lustre which I described
in this thread:

http://thread.gmane.org/gmane.comp.version-control.git/208886

I found that by simply disabling the feedback (which disables the
copious SIGALRM processing) I could alleviate the issue.

I wonder if your issues are more of the same.

I filed Lustre bug LU-2276 about it at:

http://jira.whamcloud.com/browse/LU-2276

Cheers,
b.




signature.asc
Description: OpenPGP digital signature


Re: GIT get corrupted on lustre

2012-12-24 Thread Greg Troxel

  we are using git since may and all is working fine for all of us
  (almost 20 people) on our workstations.  However, when we clone our
  repositories to the cluster, only and only there
  we are having many problems similiar to this post:

What filesystem tests have you run on lustre?  I would run every test
you can find, and lustre should have a robust test suite.  It's really
hard to be certain, but given how many filesystems git is used with,
your experience points to a lustre bug.

I would also suggest using ktrace/ktruss/strace and perhaps poring over
the logs to see if you can spot any bad behavior.



pgp6qmR6soDBX.pgp
Description: PGP signature