Hi Asterios,

Whew- I'm glad we got that figured out. I apologize for the obtuse error message. We should at least print something more helpful in that case, but I will check the documentation too.

-Phil

Asterios Katsifodimos wrote:
Hello Phil,

Yes, they differ!
 Name pvfs2-fs
        ID 947057450
        RootHandle 1048576

 Name pvfs2-fs
        ID 1529723372
        RootHandle 1048576


I was running the pvfs2-genconfig with cssh in all of the machines...
It works by copying the same file to all the nodes.

Thanks for the pointer, the errors were really misleading...

However, could we state into the documentation that the file
has to be created once and distributed to the nodes?

Thanks a lot for your quick help!

best regards,
Asterios

On Mon, Apr 6, 2009 at 10:31 PM, Phil Carns <[email protected] <mailto:[email protected]>> wrote:

    I'm running out of places to add log messages to in the code :)

    I see a possible cause that I missed before, but we should be able
    to check this one without a patch.  Can you do a "diff" of the two
    configuration files and see if they are different in any way?  In
    particular do the ID values match?


    thanks,
    -Phil

    Asterios Katsifodimos wrote:

        No, the systems are identical :)

        [r...@wn140 ~]# hostname
        wn140.grid.ucy.ac.cy <http://wn140.grid.ucy.ac.cy>
        <http://wn140.grid.ucy.ac.cy>
        [r...@wn140 ~]# uname -a
        Linux wn140.grid.ucy.ac.cy <http://wn140.grid.ucy.ac.cy>
        <http://wn140.grid.ucy.ac.cy> 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan
        14 19:07:47 CST 2009 i686 athlon i386 GNU/Linux

        [r...@wn140 ~]# cat /etc/redhat-release
        Scientific Linux SL release 4.7 (Beryllium)
        [r...@wn140 pvfs-2.8.1]# more /proc/cpuinfo
        processor       : 0
        vendor_id       : AuthenticAMD
        cpu family      : 15
        model           : 65
        model name      : Dual-Core AMD Opteron(tm) Processor 2214
        stepping        : 2
        cpu MHz         : 2200.000
        cache size      : 1024 KB
        physical id     : 0
        siblings        : 2
        core id         : 0
        cpu cores       : 2
        fdiv_bug        : no
        hlt_bug         : no
        f00f_bug        : no
        coma_bug        : no
        fpu             : yes
        fpu_exception   : yes
        cpuid level     : 1
        wp              : yes
        flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
        mtrr pge mca cmov
        pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext
        fxsr_opt rdtscp l



        [r...@wn141 ~]# hostname
        wn141.grid.ucy.ac.cy <http://wn141.grid.ucy.ac.cy>
        <http://wn141.grid.ucy.ac.cy>
        [r...@wn141 ~]# uname -a
        Linux wn141.grid.ucy.ac.cy <http://wn141.grid.ucy.ac.cy>
        <http://wn141.grid.ucy.ac.cy> 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan
        14 19:07:47 CST 2009 i686 athlon i386 GNU/Linux

        [r...@wn141 ~]# cat /etc/redhat-release
        Scientific Linux SL release 4.7 (Beryllium)
        [r...@wn141 pvfs-2.8.1]# more /proc/cpuinfo
        processor       : 0
        vendor_id       : AuthenticAMD
        cpu family      : 15
        model           : 65
        model name      : Dual-Core AMD Opteron(tm) Processor 2214
        stepping        : 2
        cpu MHz         : 2200.000
        cache size      : 1024 KB
        physical id     : 0
        siblings        : 2
        core id         : 0
        cpu cores       : 2
        fdiv_bug        : no
        hlt_bug         : no
        f00f_bug        : no
        coma_bug        : no
        fpu             : yes
        fpu_exception   : yes
        cpuid level     : 1
        wp              : yes
        flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
        mtrr pge mca cmov
        pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext
        fxsr_opt rdtscp lm


        Patch applied, logs updated!
        http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy
        http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

        thanks,
        Asterios Katsifodimos
        High Performance Computing systems Lab
        Department of Computer Science, University of Cyprus
        http://grid.ucy.ac.cy


        On Mon, Apr 6, 2009 at 10:03 PM, Phil Carns <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

           That didn't show what I expected at all.  It must have hit a
        safety
           check on the request parameters.  Could you try adding in the
           attached patch as well?

           What kind of systems are these?  Are the two servers different
           architectures by any chance?


           thanks,
           -Phil

           Asterios Katsifodimos wrote:

               Thanks!
               I have applied the patch.

               I have replaced the old logs with the new ones. Just use the
               previous links.
               http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy
               http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

               thanks a lot for your help,
               On Mon, Apr 6, 2009 at 8:41 PM, Phil Carns
        <[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>>>
        wrote:

                  Thanks for posting the logs.  It looks like the
        create_list
               function
                  in within Trove actually generated the EINVAL error,
        but there
                  aren't enough log messages in that path to know why.

                  Any chance you could apply the patch attached to this
        email and
                  retry this scenario (with verbose logging)?  I'm
        hoping for some
                  extra output after the line that looks like this:

                  (0x8d4f020) batch_create (prelude sm) state: perm_check
               (status = 0)


                  thanks,
                  -Phil


                  Asterios Katsifodimos wrote:

                      Yes both of them. Because now both are Metadata
        servers.
               When I
                      had one metadata and
                      one IO server, the metadata server was not
        producing the
               errors
                      until the IO server got up.
                       From the time that the IO server gets up, the
        Metadata
               server
                      is getting crazy...

                      I have uploaded the log files here:
http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

                      have a look!

                      thanks!
                      On Mon, Apr 6, 2009 at 7:00 PM, Phil Carns
               <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
               <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>>>
               wrote:

                         Ok.  Could you try "verbose" now as the log
        level?  It is
                      close to
                         the "all" level but should only print information
               while the
                      server
                         is busy.

                         Are both wn140 and wn141 showing the same batch
        create
               errors, or
                         just one of them?


                         thanks,
                         -Phil

                         Asterios Katsifodimos wrote:

                             Hello Phil,

                             Thanks for you answer.
                             Yes I delete the storage dir every time I
        make a new
                      configuration
                             and I run the pvfs2-server -f command before
               starting the
                      daemons.

                             The only thing that I get from the servers
        is the
                      batch_create,
                             starting server, and the "PVFS2 server got
        signal 15
                             (server_status_flag: 507903"
                             error message. Do you want me to try on an
        other
               log level?

                             Also, this is how the server is configured:
                             ***** Displaying PVFS Configuration
        Information *****
------------------------------------------------------ PVFS2 configured to build karma gui : no
                             PVFS2 configured to perform coverage
        analysis            :  no
                             PVFS2 configured for aio threaded callbacks
                     : yes
PVFS2 configured to use FUSE : no
                             PVFS2 configured for the 2.6.x kernel
        module             :  no
                             PVFS2 configured for the 2.4.x kernel
        module             :  no
                             PVFS2 configured for using the
        mmap-ra-cache             :  no
                             PVFS2 will use workaround for redhat 2.4
        kernels
                :  no
PVFS2 will use workaround for buggy NPTL : no PVFS2 server will be built : yes

                             PVFS2 version string: 2.8.1


                             thanks again,
                             On Mon, Apr 6, 2009 at 5:21 PM, Phil Carns
                      <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
               <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
                             <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>>>
                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
               <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>

                             <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>>>>>

                      wrote:

                                Hello,

                                I'm not sure what would cause that "Invalid
               argument"
                      error.

                                Could you try the following steps:

                                - kill both servers
                                - modify your configuration files to set
                      "EventLogging" to "none"
                                - delete your old log files (or move them to
               another
                      directory)
                                - start the servers

                                You can then send us the complete
        contents of
               both log
                      files
                             and we
                                can go from there.  The "all" level is a
        little
               hard
                      to interpret
                                because it generates a lot of
        information even when
                      servers
                             are idle.

                                Also, when you went from one server to
        two, did
               you delete
                             your old
                                storage space (/pvfs) and start over, or are
               you trying to
                             keep that
                                data and add servers to it?

                                thanks!
                                -Phil

                                Asterios Katsifodimos wrote:

                                    Hello all,

                                    I have been trying to install PVFS
        2.8.1 on
               Ubuntu
                      server,
                                    Centos4 and
                                    Scientific Linux 4. I compile it and can
               run it on
                      a "single
                                    host" configuration
                                    without any problems.

                                    However, when I add more nodes to the
                             configuration(always using the
                                    pvfs2-genconfig defaults ) I have
        the following
                      problem:

                                    *On the metadata node I get these
        messages:*
                                    [E 04/02 20:16] batch_create request
        got:
               Invalid
                      argument
                                    [E 04/02 20:16] batch_create request
        got:
               Invalid
                      argument
                                    [E 04/02 20:16] batch_create request
        got:
               Invalid
                      argument
                                    [E 04/02 20:16] batch_create request
        got:
               Invalid
                      argument


                                    *In the IO nodes I get:*
                                    [r...@wn140 ~]# tail -50
        /tmp/pvfs2-server.log
                                    [D 04/02 23:53] BMI_testcontext
        completing:
                             18446744072456767880
                                    [D 04/02 23:53] [SM Entering]:
        (0x88f8b00)
                                    msgpairarray_sm:complete (status: 1)
                                    [D 04/02 23:53] [SM frame get]:
        (0x88f8b00)
               op-id: 37
                             index: 0
                                    base-frm: 1
                                    [D 04/02 23:53]
        msgpairarray_complete: sm
               0x88f8b00
                                    status_user_tag 1 msgarray_count 1
                                    [D 04/02 23:53]   msgpairarray: 1
               operations remain
                                    [D 04/02 23:53] [SM Exiting]:
        (0x88f8b00)
                                    msgpairarray_sm:complete (error code:
                      -1073742006), (action:
                                    DEFERRED)
                                    [D 04/02 23:53] [SM Entering]:
        (0x88f8b00)
                                    msgpairarray_sm:complete (status: 0)
                                    [D 04/02 23:53] [SM frame get]:
        (0x88f8b00)
               op-id: 37
                             index: 0
                                    base-frm: 1
                                    [D 04/02 23:53]
        msgpairarray_complete: sm
               0x88f8b00
                                    status_user_tag 0 msgarray_count 1
                                    [D 04/02 23:53]   msgpairarray: all
        operations
                      complete
                                    [D 04/02 23:53] [SM Exiting]:
        (0x88f8b00)
                                    msgpairarray_sm:complete (error
        code: 190),
               (action:
                             COMPLETE)
                                    [D 04/02 23:53] [SM Entering]:
        (0x88f8b00)
                                    msgpairarray_sm:completion_fn
        (status: 0)
                                    [D 04/02 23:53] [SM frame get]:
        (0x88f8b00)
               op-id: 37
                             index: 0
                                    base-frm: 1
                                    [D 04/02 23:53] (0x88f8b00)
        msgpairarray state:
                      completion_fn
                                    [E 04/02 23:53] Warning: msgpair
        failed to
                      tcp://wn141:3334,
                                    will retry: Connection refused
                                    [D 04/02 23:53] ***
        msgpairarray_completion_fn:
                      msgpair 0
                                    failed, retry 1
                                    [D 04/02 23:53] ***
        msgpairarray_completion_fn:
                      msgpair
                             retrying
                                    after delay.
                                    [D 04/02 23:53] [SM Exiting]:
        (0x88f8b00)
                                    msgpairarray_sm:completion_fn (error
        code:
               191),
                      (action:
                             COMPLETE)
                                    [D 04/02 23:53] [SM Entering]:
        (0x88f8b00)
                                    msgpairarray_sm:post_retry (status: 0)
                                    [D 04/02 23:53] [SM frame get]:
        (0x88f8b00)
               op-id: 37
                             index: 0
                                    base-frm: 1
                                    [D 04/02 23:53]
        msgpairarray_post_retry: sm
               0x88f8b00,
                             wait 2000 ms
                                    [D 04/02 23:53] [SM Exiting]:
        (0x88f8b00)
                                    msgpairarray_sm:post_retry (error
        code: 0),
               (action:
                             DEFERRED)
                                    [D 04/02 23:53] [SM Entering]:
        (0x89476c0)
                                    perf_update_sm:do_work (status: 0)
                                    [P 04/02 23:53] Start times
        (hr:min:sec):
                       23:53:11.330
                                     23:53:10.310  23:53:09.287
         23:53:08.268
                       23:53:07.245
                                     23:53:06.225
                                    [P 04/02 23:53] Intervals
        (hr:min:sec)  :
                       00:00:01.026
                                     00:00:01.020  00:00:01.023
         00:00:01.019
                       00:00:01.023
                                     00:00:01.020
                                    [P 04/02 23:53]
------------------------------------------------------------------------------------------------------------- [P 04/02 23:53] bytes read : 0 0 0 0 0 0 [P 04/02 23:53] bytes written : 0 0 0 0 0 0 [P 04/02 23:53] metadata reads : 0 0 0 0 0 0 [P 04/02 23:53] metadata writes : 0 0 0 0 0 0
                                    [P 04/02 23:53] metadata dspace ops
: 0 0 0 0 0 0
                                    [P 04/02 23:53] metadata keyval ops
: 1 1 1 1 1 1 [P 04/02 23:53] request scheduler : 0 0 0 0 0 0
                                    [D 04/02 23:53] [SM Exiting]:
        (0x89476c0)
                             perf_update_sm:do_work
                                    (error code: 0), (action: DEFERRED)
                                    [D 04/02 23:53] [SM Entering]:
        (0x8948810)
                             job_timer_sm:do_work
                                    (status: 0)
                                    [D 04/02 23:53] [SM Exiting]:
        (0x8948810)
                             job_timer_sm:do_work
                                    (error code: 0), (action: DEFERRED)
                                    [D 04/02 23:53] [SM Entering]:
        (0x89476c0)
                                    perf_update_sm:do_work (status: 0)
                                    [P 04/02 23:53] Start times
        (hr:min:sec):
                       23:53:12.356
                                     23:53:11.330  23:53:10.310
         23:53:09.287
                       23:53:08.268
                                     23:53:07.245
                                    [P 04/02 23:53] Intervals
        (hr:min:sec)  :
                       00:00:01.020
                                     00:00:01.026  00:00:01.020
         00:00:01.023
                       00:00:01.019
                                     00:00:01.023
                                    [P 04/02 23:53]
------------------------------------------------------------------------------------------------------------- [P 04/02 23:53] bytes read : 0 0 0 0 0 0 [P 04/02 23:53] bytes written : 0 0 0 0 0 0 [P 04/02 23:53] metadata reads : 0 0 0 0 0 0 [P 04/02 23:53] metadata writes : 0 0 0 0 0 0
                                    [P 04/02 23:53] metadata dspace ops
: 0 0 0 0 0 0
                                    [P 04/02 23:53] metadata keyval ops
: 1 1 1 1 1 1 [P 04/02 23:53] request scheduler : 0 0 0 0 0 0
                                    [D 04/02 23:53] [SM Exiting]:
        (0x89476c0)
                             perf_update_sm:do_work
                                    (error code: 0), (action: DEFERRED)
                                    [D 04/02 23:53] [SM Entering]:
        (0x8948810)
                             job_timer_sm:do_work
                                    (status: 0)
                                    [D 04/02 23:53] [SM Exiting]:
        (0x8948810)
                             job_timer_sm:do_work
                                    (error code: 0), (action: DEFERRED)


                                    The metadata node keeps asking for
               something that
                      the IO
                             nodes
                                    cannot give
                                    the right way. So it complains. This
        makes the
                      nodes and the
                                    metadata node
                                    not to work.

                                    I have installed those services many
        times.
               I have
                      tested
                             this
                                    using berkeley
                                    db 4.2 and 4.3 on Redhat systems(centos,
               scientific
                             linnux) and
                                    on Ubuntu server.

                                    I have also tried the PVFS version 2.6.3
               and I get the
                             same problem.

                                    *My config files look like:*
                                    [r...@wn140 ~]# more /etc/pvfs2-fs.conf
                                    <Defaults>
                                       UnexpectedRequests 50
                                       EventLogging all
                                       EnableTracing no
                                       LogStamp datetime
                                       BMIModules bmi_tcp
                                       FlowModules flowproto_multiqueue
                                       PerfUpdateInterval 1000
                                       ServerJobBMITimeoutSecs 30
                                       ServerJobFlowTimeoutSecs 30
                                       ClientJobBMITimeoutSecs 300
                                       ClientJobFlowTimeoutSecs 300
                                       ClientRetryLimit 5
                                       ClientRetryDelayMilliSecs 2000
                                       PrecreateBatchSize 512
                                       PrecreateLowThreshold 256

                                       StorageSpace /pvfs
                                       LogFile /tmp/pvfs2-server.log
                                    </Defaults>

                                    <Aliases>
                                       Alias wn140 tcp://wn140:3334
                                       Alias wn141 tcp://wn141:3334
                                    </Aliases>

                                    <Filesystem>
                                       Name pvfs2-fs
                                       ID 320870944
                                       RootHandle 1048576
                                       FileStuffing yes
                                       <MetaHandleRanges>
                                           Range wn140 3-2305843009213693953
                                           Range wn141
                      2305843009213693954-4611686018427387904
                                       </MetaHandleRanges>
                                       <DataHandleRanges>
                                           Range wn140
                      4611686018427387905-6917529027641081855
                                           Range wn141
                      6917529027641081856-9223372036854775806
                                       </DataHandleRanges>
                                       <StorageHints>
                                           TroveSyncMeta yes
                                           TroveSyncData no
                                           TroveMethod alt-aio
                                       </StorageHints>
                                    </Filesystem>


                                    My setup is made from two nodes that are
               both IO
                      and Metadata
                                    nodes. I have also tried
                                    a 4 node setup with 2I/O - 2 MD nodes
               resulting in the
                             same thing.

                                    Any suggestions?

                                    thank you in advance,
                                    --
                                    Asterios Katsifodimos
                                    High Performance Computing systems Lab
                                    Department of Computer Science,
        University
               of Cyprus
                                    http://www.asteriosk.gr
               <http://www.asteriosk.gr/>


------------------------------------------------------------------------

_______________________________________________
                                    Pvfs2-users mailing list
                                    [email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>
                             <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>>
<mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>
                             <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>>>

http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users












_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to