Murali,

Yes, it isn't good. We still have hopes that an fsck is the fix as there are a number of problems when we run in non-fix mode only. Just haven't been able to obtain the system and the go ahead from the users who need this for their research since many use the ROMIO access method and are not experiencing any problems.

Nothing special here for the kernel config.  We're running a RHEL4 repo.

Bill

Murali Vilayannur wrote:
Hi Bill,
This is really bad..I wish I had a system to repro your setup..
Is there something special in your kernel .config?
(PREEMPT for example)
What distro is this on btw?
thanks,
Murali


On 8/24/07, Bill Wichser <[EMAIL PROTECTED]> wrote:
We have been experiencing frequent crashes in the PVFS2 kernel module
when applications use standard system I/O to write to PVFS2 files. We
are running the Linux 2.6.9-55.0.2 smp kernel, and PVFS2 v2.6.3.
The general protection fault almost always occurs at
pvfs2_devreq_writev+351.

In our build, the invalid reference specifically occurs in the
qhash_del() operation, within the inline qhash_search_and_remove()
function called by pvfs2_devreq_writev(). See excerpts below:

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
static ssize_t pvfs2_devreq_writev(
     struct file *file,
     const struct iovec *iov,
     unsigned long count,
     loff_t * offset)
{
.
.
.
     /* lookup (and remove) the op based on the tag */
     hash_link = qhash_search_and_remove(htable_ops_in_progress, &(tag));
     if (hash_link)
     {
.
.
.
}

/* qhash_search_and_remove()
  *
  * searches for and removes a link in the hash table
  * that matches the given key
  *
  * returns pointer to link on success, NULL on failure (or item
  * not found).  On success, link is removed from hashtable.
  */
static inline struct qhash_head *qhash_search_and_remove(
     struct qhash_table *table,
     void *key)
{
     int index = 0;
     struct qhash_head *tmp_link = NULL;

     /* find the hash value */    index = table->hash(key,
table->table_size);

     /* linear search at index to find match */
     qhash_lock(&table->lock);
     qhash_for_each(tmp_link, &(table->array[index]))
     {
         if (table->compare(key, tmp_link))
         {
             qhash_del(tmp_link);
             qhash_unlock(&table->lock);
             return (tmp_link);
         }
     }
     qhash_unlock(&table->lock);
     return (NULL);
}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We have since run pvfs2-fsck on the file system and have found some
corruption. So we're not sure if what we're seeing is just a
second-order effect of the corruption, or is the actual cause of the
corruption.

So we're passing this along to you to see if you've had any similar
reports, or can point us in the right direction to help find the
problem.

The crash file sys and bt info follows. Please let us know if you need
more information.

Thanks,
Bill

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
crash> sys
   SYSTEM MAP: /boot/System.map-2.6.9-55.0.2.ELsmp
DEBUG KERNEL: /home/jsbillin/vmlinux-2.6.9-55.ELsmp (2.6.9-55.ELsmp)
     DUMPFILE: /var/crash/172.18.0.85-2007-08-06-07:47/vmcore
         CPUS: 4
         DATE: Fri Aug 17 11:53:17 2007
       UPTIME: 11 days, 04:08:05
LOAD AVERAGE: 2.49, 2.10, 1.63
        TASKS: 96
     NODENAME: woodhen-085
      RELEASE: 2.6.9-55.0.2.ELsmp
      VERSION: #1 SMP Mon Jun 25 14:12:33 EDT 2007
      MACHINE: x86_64  (2660 Mhz)
       MEMORY: 9 GB
        PANIC: ""
crash> bt
PID: 3454   TASK: 10236b63030       CPU: 0   COMMAND: "pvfs2-client-co"
  #0 [10232fbbc60] netpoll_start_netdump at ffffffffa0249366
  #1 [10232fbbc90] die at ffffffff80111c00
  #2 [10232fbbcb0] do_general_protection at ffffffff801124e5
  #3 [10232fbbcf0] error_exit at ffffffff80110d91
     [exception RIP: pvfs2_devreq_writev+351]
     RIP: ffffffffa0226948  RSP: 0000010232fbbda8  RFLAGS: 00010246
     RAX: 0000000000000000  RBX: 40903a138d84f800  RCX: 0000000000000000
     RDX: 40903a138d84f800  RSI: 00000101aeab1bd8  RDI: 0000010232fbbdc0
     RBP: 0000010006bccd40   R8: 0000000000000000   R9: 0000000000000000
     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
     R13: 000001020557e600  R14: 000001020557e5f0  R15: 0000010232fbbe88
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #4 [10232fbbda0] pvfs2_devreq_writev at ffffffffa022693e
  #5 [10232fbbe00] sock_readv_writev at ffffffff802a91f9
  #6 [10232fbbe60] do_readv_writev at ffffffff8017a45f
  #7 [10232fbbf40] sys_writev at ffffffff8017a631
  #8 [10232fbbf80] system_call at ffffffff8011026a
     RIP: 00000035854bfcdb  RSP: 0000007fbffff228  RFLAGS: 00010202
     RAX: 0000000000000014  RBX: ffffffff8011026a  RCX: 00000035854bf1e9
     RDX: 0000000000000004  RSI: 0000007fbffff120  RDI: 0000000000000005
     RBP: 0000000000000000   R8: 0000000000000001   R9: 0000000000000004
     R10: 0000000000000001  R11: 0000000000000206  R12: 0000000000000005
     R13: 0000007fbffff120  R14: 0000000000000004  R15: 0000000000000000
     ORIG_RAX: 0000000000000014  CS: 0033  SS: 002b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to