Thanks for posting the logs. It looks like the create_list function in within Trove actually generated the EINVAL error, but there aren't enough log messages in that path to know why.

Any chance you could apply the patch attached to this email and retry this scenario (with verbose logging)? I'm hoping for some extra output after the line that looks like this:

(0x8d4f020) batch_create (prelude sm) state: perm_check (status = 0)

thanks,
-Phil


Asterios Katsifodimos wrote:
Yes both of them. Because now both are Metadata servers. When I had one metadata and one IO server, the metadata server was not producing the errors until the IO server got up. From the time that the IO server gets up, the Metadata server is getting crazy...

I have uploaded the log files here:
http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy
http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

have a look!

thanks!
On Mon, Apr 6, 2009 at 7:00 PM, Phil Carns <[email protected] <mailto:[email protected]>> wrote:

    Ok.  Could you try "verbose" now as the log level?  It is close to
    the "all" level but should only print information while the server
    is busy.

    Are both wn140 and wn141 showing the same batch create errors, or
    just one of them?


    thanks,
    -Phil

    Asterios Katsifodimos wrote:

        Hello Phil,

        Thanks for you answer.
        Yes I delete the storage dir every time I make a new configuration
        and I run the pvfs2-server -f command before starting the daemons.

        The only thing that I get from the servers is the batch_create,
        starting server, and the "PVFS2 server got signal 15
        (server_status_flag: 507903"
        error message. Do you want me to try on an other log level?

        Also, this is how the server is configured:
        ***** Displaying PVFS Configuration Information *****
        ------------------------------------------------------
        PVFS2 configured to build karma gui               :  no
        PVFS2 configured to perform coverage analysis     :  no
        PVFS2 configured for aio threaded callbacks       : yes
        PVFS2 configured to use FUSE                      :  no
        PVFS2 configured for the 2.6.x kernel module      :  no
        PVFS2 configured for the 2.4.x kernel module      :  no
        PVFS2 configured for using the mmap-ra-cache      :  no
        PVFS2 will use workaround for redhat 2.4 kernels  :  no
        PVFS2 will use workaround for buggy NPTL          :  no
        PVFS2 server will be built                        : yes

        PVFS2 version string: 2.8.1


        thanks again,
        On Mon, Apr 6, 2009 at 5:21 PM, Phil Carns <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

           Hello,

           I'm not sure what would cause that "Invalid argument" error.

           Could you try the following steps:

           - kill both servers
           - modify your configuration files to set "EventLogging" to "none"
           - delete your old log files (or move them to another directory)
           - start the servers

           You can then send us the complete contents of both log files
        and we
           can go from there.  The "all" level is a little hard to interpret
           because it generates a lot of information even when servers
        are idle.

           Also, when you went from one server to two, did you delete
        your old
           storage space (/pvfs) and start over, or are you trying to
        keep that
           data and add servers to it?

           thanks!
           -Phil

           Asterios Katsifodimos wrote:

               Hello all,

               I have been trying to install PVFS 2.8.1 on Ubuntu server,
               Centos4 and
               Scientific Linux 4. I compile it and can run it on a "single
               host" configuration
               without any problems.

               However, when I add more nodes to the
        configuration(always using the
               pvfs2-genconfig defaults ) I have the following problem:

               *On the metadata node I get these messages:*
               [E 04/02 20:16] batch_create request got: Invalid argument
               [E 04/02 20:16] batch_create request got: Invalid argument
               [E 04/02 20:16] batch_create request got: Invalid argument
               [E 04/02 20:16] batch_create request got: Invalid argument


               *In the IO nodes I get:*
               [r...@wn140 ~]# tail -50 /tmp/pvfs2-server.log
               [D 04/02 23:53] BMI_testcontext completing:
        18446744072456767880
               [D 04/02 23:53] [SM Entering]: (0x88f8b00)
               msgpairarray_sm:complete (status: 1)
               [D 04/02 23:53] [SM frame get]: (0x88f8b00) op-id: 37
        index: 0
               base-frm: 1
               [D 04/02 23:53] msgpairarray_complete: sm 0x88f8b00
               status_user_tag 1 msgarray_count 1
               [D 04/02 23:53]   msgpairarray: 1 operations remain
               [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
               msgpairarray_sm:complete (error code: -1073742006), (action:
               DEFERRED)
               [D 04/02 23:53] [SM Entering]: (0x88f8b00)
               msgpairarray_sm:complete (status: 0)
               [D 04/02 23:53] [SM frame get]: (0x88f8b00) op-id: 37
        index: 0
               base-frm: 1
               [D 04/02 23:53] msgpairarray_complete: sm 0x88f8b00
               status_user_tag 0 msgarray_count 1
               [D 04/02 23:53]   msgpairarray: all operations complete
               [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
               msgpairarray_sm:complete (error code: 190), (action:
        COMPLETE)
               [D 04/02 23:53] [SM Entering]: (0x88f8b00)
               msgpairarray_sm:completion_fn (status: 0)
               [D 04/02 23:53] [SM frame get]: (0x88f8b00) op-id: 37
        index: 0
               base-frm: 1
               [D 04/02 23:53] (0x88f8b00) msgpairarray state: completion_fn
               [E 04/02 23:53] Warning: msgpair failed to tcp://wn141:3334,
               will retry: Connection refused
               [D 04/02 23:53] *** msgpairarray_completion_fn: msgpair 0
               failed, retry 1
               [D 04/02 23:53] *** msgpairarray_completion_fn: msgpair
        retrying
               after delay.
               [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
               msgpairarray_sm:completion_fn (error code: 191), (action:
        COMPLETE)
               [D 04/02 23:53] [SM Entering]: (0x88f8b00)
               msgpairarray_sm:post_retry (status: 0)
               [D 04/02 23:53] [SM frame get]: (0x88f8b00) op-id: 37
        index: 0
               base-frm: 1
               [D 04/02 23:53] msgpairarray_post_retry: sm 0x88f8b00,
        wait 2000 ms
               [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
               msgpairarray_sm:post_retry (error code: 0), (action:
        DEFERRED)
               [D 04/02 23:53] [SM Entering]: (0x89476c0)
               perf_update_sm:do_work (status: 0)
               [P 04/02 23:53] Start times (hr:min:sec):  23:53:11.330
                23:53:10.310  23:53:09.287  23:53:08.268  23:53:07.245
                23:53:06.225
               [P 04/02 23:53] Intervals (hr:min:sec)  :  00:00:01.026
                00:00:01.020  00:00:01.023  00:00:01.019  00:00:01.023
                00:00:01.020
               [P 04/02 23:53]
------------------------------------------------------------------------------------------------------------- [P 04/02 23:53] bytes read : 0 0 0 0 0 0 [P 04/02 23:53] bytes written : 0 0 0 0 0 0 [P 04/02 23:53] metadata reads : 0 0 0 0 0 0 [P 04/02 23:53] metadata writes : 0 0 0 0 0 0 [P 04/02 23:53] metadata dspace ops : 0 0 0 0 0 0 [P 04/02 23:53] metadata keyval ops : 1 1 1 1 1 1 [P 04/02 23:53] request scheduler : 0 0 0 0 0 0
               [D 04/02 23:53] [SM Exiting]: (0x89476c0)
        perf_update_sm:do_work
               (error code: 0), (action: DEFERRED)
               [D 04/02 23:53] [SM Entering]: (0x8948810)
        job_timer_sm:do_work
               (status: 0)
               [D 04/02 23:53] [SM Exiting]: (0x8948810)
        job_timer_sm:do_work
               (error code: 0), (action: DEFERRED)
               [D 04/02 23:53] [SM Entering]: (0x89476c0)
               perf_update_sm:do_work (status: 0)
               [P 04/02 23:53] Start times (hr:min:sec):  23:53:12.356
                23:53:11.330  23:53:10.310  23:53:09.287  23:53:08.268
                23:53:07.245
               [P 04/02 23:53] Intervals (hr:min:sec)  :  00:00:01.020
                00:00:01.026  00:00:01.020  00:00:01.023  00:00:01.019
                00:00:01.023
               [P 04/02 23:53]
------------------------------------------------------------------------------------------------------------- [P 04/02 23:53] bytes read : 0 0 0 0 0 0 [P 04/02 23:53] bytes written : 0 0 0 0 0 0 [P 04/02 23:53] metadata reads : 0 0 0 0 0 0 [P 04/02 23:53] metadata writes : 0 0 0 0 0 0 [P 04/02 23:53] metadata dspace ops : 0 0 0 0 0 0 [P 04/02 23:53] metadata keyval ops : 1 1 1 1 1 1 [P 04/02 23:53] request scheduler : 0 0 0 0 0 0
               [D 04/02 23:53] [SM Exiting]: (0x89476c0)
        perf_update_sm:do_work
               (error code: 0), (action: DEFERRED)
               [D 04/02 23:53] [SM Entering]: (0x8948810)
        job_timer_sm:do_work
               (status: 0)
               [D 04/02 23:53] [SM Exiting]: (0x8948810)
        job_timer_sm:do_work
               (error code: 0), (action: DEFERRED)


               The metadata node keeps asking for something that the IO
        nodes
               cannot give
               the right way. So it complains. This makes the nodes and the
               metadata node
               not to work.

               I have installed those services many times. I have tested
        this
               using berkeley
               db 4.2 and 4.3 on Redhat systems(centos, scientific
        linnux) and
               on Ubuntu server.

               I have also tried the PVFS version 2.6.3 and I get the
        same problem.

               *My config files look like:*
               [r...@wn140 ~]# more /etc/pvfs2-fs.conf
               <Defaults>
                  UnexpectedRequests 50
                  EventLogging all
                  EnableTracing no
                  LogStamp datetime
                  BMIModules bmi_tcp
                  FlowModules flowproto_multiqueue
                  PerfUpdateInterval 1000
                  ServerJobBMITimeoutSecs 30
                  ServerJobFlowTimeoutSecs 30
                  ClientJobBMITimeoutSecs 300
                  ClientJobFlowTimeoutSecs 300
                  ClientRetryLimit 5
                  ClientRetryDelayMilliSecs 2000
                  PrecreateBatchSize 512
                  PrecreateLowThreshold 256

                  StorageSpace /pvfs
                  LogFile /tmp/pvfs2-server.log
               </Defaults>

               <Aliases>
                  Alias wn140 tcp://wn140:3334
                  Alias wn141 tcp://wn141:3334
               </Aliases>

               <Filesystem>
                  Name pvfs2-fs
                  ID 320870944
                  RootHandle 1048576
                  FileStuffing yes
                  <MetaHandleRanges>
                      Range wn140 3-2305843009213693953
                      Range wn141 2305843009213693954-4611686018427387904
                  </MetaHandleRanges>
                  <DataHandleRanges>
                      Range wn140 4611686018427387905-6917529027641081855
                      Range wn141 6917529027641081856-9223372036854775806
                  </DataHandleRanges>
                  <StorageHints>
                      TroveSyncMeta yes
                      TroveSyncData no
                      TroveMethod alt-aio
                  </StorageHints>
               </Filesystem>


               My setup is made from two nodes that are both IO and Metadata
               nodes. I have also tried
               a 4 node setup with 2I/O - 2 MD nodes resulting in the
        same thing.

               Any suggestions?

               thank you in advance,
               --
               Asterios Katsifodimos
               High Performance Computing systems Lab
               Department of Computer Science, University of Cyprus
               http://www.asteriosk.gr <http://www.asteriosk.gr/>


------------------------------------------------------------------------

               _______________________________________________
               Pvfs2-users mailing list
               [email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>

http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users






? log.txt
? doc/citeseer.bib
? doc/foo.patch
? doc/google-scholar.bib
? examples/heartbeat/.cib.xml.example.swp
? src/apps/admin/boom.conf
? src/apps/admin/foo.conf
? src/apps/admin/pvfs2-cp-threadtest.c
? src/client/sysint/.mgmt-setparam-list.sm.swp
? src/io/bmi/bmi_tcp.tgz
? src/io/bmi/bmi_mx/log.txt
? src/io/bmi/bmi_tcp/bmi-tcp.c.hacked
? src/io/bmi/bmi_tcp/bmi-tcp.c.hacked2
? src/io/bmi/bmi_tcp/foo.patch
? src/io/bmi/bmi_tcp/log.txt
? src/io/bmi/bmi_tcp/socket-collection-epoll.c.hacked
? src/io/bmi/bmi_tcp/socket-collection-epoll.c.pipe
? src/io/bmi/bmi_tcp/socket-collection-epoll.h.hacked
? src/io/bmi/bmi_tcp/socket-collection-epoll.h.pipe
? src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c.backup
? src/kernel/linux-2.6/597.patch
? src/kernel/linux-2.6/foo.patch
? src/kernel/linux-2.6/log.txt
? src/kernel/linux-2.6/out.txt
? src/proto/.pvfs2-req-proto.h.swp
? test/automated/mpi-vfs-tests.d/fsx-mpi
? test/automated/sysint-tests.d/out.txt
? test/automated/vfs-tests.d/fsx-bin
? test/client/mpi-io/mpi-io-test
? test/client/mpi-io/test.out
Index: src/io/trove/trove-dbpf/dbpf-dspace.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/io/trove/trove-dbpf/dbpf-dspace.c,v
retrieving revision 1.163
diff -a -u -p -r1.163 dbpf-dspace.c
--- src/io/trove/trove-dbpf/dbpf-dspace.c	30 Jan 2009 15:41:08 -0000	1.163
+++ src/io/trove/trove-dbpf/dbpf-dspace.c	6 Apr 2009 17:36:55 -0000
@@ -314,11 +314,14 @@ static int dbpf_dspace_create_list(TROVE
         &op_p);
     if(ret < 0)
     {
+        gossip_err("Error: dbpf_op_init_queued_or_immediate() failure in create_list.\n");
         return ret;
     }
 
     if (!extent_array || (extent_array->extent_count < 1))
     {
+        gossip_err("Error: bad extent array in create_list: %p, %d.\n",
+            extent_array, (extent_array?extent_array->extent_count:0));
         return -TROVE_EINVAL;
     }
 
@@ -338,6 +341,7 @@ static int dbpf_dspace_create_list(TROVE
 
     if (op_p->u.d_create_list.extent_array.extent_array == NULL)
     {
+        gossip_err("Error: enomem in create_list.\n");
         return -TROVE_ENOMEM;
     }
 
@@ -391,6 +395,7 @@ static int dbpf_dspace_create_list_op_sv
             new_handle);
         if(ret < 0)
         {
+            gossip_err("Error: dbpf_dspace_create_store_handle in create_list.\n");
             /* release any handles we grabbed so far */
             for(j=0; j<=i; j++)
             {
Index: src/server/batch-create.sm
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/server/batch-create.sm,v
retrieving revision 1.3
diff -a -u -p -r1.3 batch-create.sm
--- src/server/batch-create.sm	20 Nov 2008 01:17:10 -0000	1.3
+++ src/server/batch-create.sm	6 Apr 2009 17:36:55 -0000
@@ -133,6 +133,12 @@ static int batch_create_cleanup(
                 llu(s_op->resp.u.batch_create.handle_array[i]));
         }
     }
+    else
+    {
+        gossip_debug(
+            GOSSIP_SERVER_DEBUG, "job_trove_dspace_create_list failed: %d\n",
+            s_op->resp.status);
+    }
 
     if(s_op->resp.u.batch_create.handle_array)
     {
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to