Re: [A51] Error

Christoffer Jerkeby Wed, 04 Nov 2009 04:45:31 -0800

Hey Sacha, I ran the nvidia memtest on one of the cards and found it
corrupt. I returned the card but the other one is now doing a good
190-260Chains/s. When I get the other one back I will work on some
optimization for the speed on both (from the advice on this list).


Thanks for your help and patients.

Regards Kugg

On Tue, Oct 20, 2009 at 9:16 PM, Sascha Krissler <[email protected]> wrote:
> The number of chains per second < 100 suggest that your GPU is running at 
> 399Mhz.
> My 260 is slowed down to this clockspeed after a launch failure in the kernel 
> which requires
> a reboot to restore the original 576 Mhz clockspeed. is this the case? you 
> can see the
> clockspeed with the nvidia-settings tool or with nvclock. The error never 
> happened on my box,
> which is a GTX260 (single card) and i have already computed 3 month worth of 
> chains.
> Hope your investigation brings more insight
>
>> This is probably going to sound wierd but "mostly no".
>> So, I can run like this for ten times and get the same error over and
>> over again. Eventually I will get a message saying no cuda devices
>> available, which requires me to reboot.
>>
>> Sascha I will try to think of a way where I can help you reproduce it.
>>
>> I don't mind waiting for the new version. But I suspect this bug is in
>> the cuda environment itself since I get the same error with the same
>> frequency of problems when running for instance gpu_md5_bruteforce.
>>
>> I will play around with this some more during the coming days to see
>> if I can come up with some more informative output.
>>
>> Thanks Kugg
>>
>> On Tue, Oct 20, 2009 at 9:12 PM, Sascha Krissler <[email protected]> 
>> wrote:
>> > A way to reproduce would be cool. Otherwise i could
>> > just ignore the error and recover+restart the cuda runtime context.
>> > Since the new version changes quite a lot of things, i hope
>> > the error does not occur there.
>> > Do you have to reboot after those kinds of erros or can you just restart
>> > the application?
>> >
>> >
>> >> Hey sorry for my late reply, I suspect that this problem persist on
>> >> other cuda applicaitons.
>> >> How ever here is the information:
>> >> 1. Linux version
>> >> Linux kakmonstret 2.6.28-11-server #42-Ubuntu SMP Fri Apr 17 02:45:36
>> >> UTC 2009 x86_64 GNU/Linux
>> >>
>> >> 2. CPU info
>> >> vendor_id       : AuthenticAMD
>> >> model name      : AMD Athlon(tm) II X2 240 Processor
>> >> cpu MHz         : 2809.543
>> >>
>> >> 3. GPU and driver info
>> >> Device 0: "GeForce GTX 260"
>> >>   CUDA Driver Version:                           2.30
>> >>   CUDA Runtime Version:                          2.30
>> >>   CUDA Capability Major revision number:         1
>> >>   CUDA Capability Minor revision number:         3
>> >>   Total amount of global memory:                 938803200 bytes
>> >>   Clock rate:                                    1.46 GHz
>> >>
>> >> Device 1: "GeForce GTX 260"
>> >>   CUDA Driver Version:                           2.30
>> >>   CUDA Runtime Version:                          2.30
>> >>   CUDA Capability Major revision number:         1
>> >>   CUDA Capability Minor revision number:         3
>> >>   Total amount of global memory:                 939261952 bytes
>> >>   Clock rate:                                    1.46 GHz
>> >>
>> >> Ill try with the newer versions from: 
>> >> http://www.nvidia.com/object/cuda_get.html
>> >> and will report back if it works.
>> >>
>> >> Regards Kugg
>> >>
>> >> On Mon, Oct 5, 2009 at 6:20 PM, Sascha Krissler <[email protected]> 
>> >> wrote:
>> >> > i do not have a solution for this problem or a good guess what the 
>> >> > problem
>> >> > is, so i ask you to wait for the next release and if the problem 
>> >> > remains i will
>> >> > take a look at cuda-gdb and see whether it is usable or write a kernel 
>> >> > that generates
>> >> > more debugging information.
>> >> > cuda-gdb should be able to print information about the error, so if you 
>> >> > want to invest
>> >> > time, you can try it out. it should be able to at least print the 
>> >> > source file line number
>> >> > of the instruction that was responsible for the error in the case of 
>> >> > the failed cudaThreadSynchronize,
>> >> > the error in the memcpy and the no_device_found are a different story 
>> >> > as no code is
>> >> > executed on the GPU in that case.
>> >> > maybe your drivers are too old. also if you are on a 32bit system you 
>> >> > have to compile
>> >> > with -malign-double as enabled by default in the Makefile.local.dist.
>> >> > Maybe you can post nvidia driver version, cpu arch and linux version.
>> >> >
>> >> >> Trying again gave me a similar error:
>> >> >> $ ./a51table --condition rounds:rounds=32 --roundfunc
>> >> >> xor:condition=distinguished_point::bits=15:generator=lfsr::tablesize=32::advance=139584
>> >> >> --implementation sharedmem --algorithm A51 --device
>> >> >> cuda:operations=512 --work random:prefix=11,0 --consume
>> >> >> file:prefix=data:append --logger normal generate --chains 380000000
>> >> >> --chainlength 3000000 --intermediate filter:runlength=512
>> >> >> Initialize implementation sharedmem...
>> >> >> 106 chains done, current rate 1.77 chains/sec (interval: 00:01:00)
>> >> >> 6633 chains done, current rate 108.78 chains/sec (interval: 00:01:00)
>> >> >> 10350 chains done, current rate 61.95 chains/sec (interval: 00:01:00)
>> >> >> 14632 chains done, current rate 71.37 chains/sec (interval: 00:01:00)
>> >> >> 19810 chains done, current rate 86.30 chains/sec (interval: 00:01:00)
>> >> >> ../tmto/device/cuda/working_set_methods.hpp(38)[void
>> >> >> tmto::device::cuda::working_set::simple_host<T,
>> >> >> Round>::copyToDevice(int) [with T =
>> >> >> tmto::device::combined_work_item<tmto::algorithm::A51::data_type,
>> >> >> tmto::configuration::state::state<void, void,
>> >> >> tmto::condition::tag::rounds,
>> >> >> tmto::round_function::arguments::selector<tmto::round_function::tag::xor_,
>> >> >> tmto::condition::tag::distinguished_point,
>> >> >> tmto::round_function::generator::tag::sharedmem<tmto::round_function::gen
>> >> >>
>> >> >> Trying one more time I got
>> >> >> $ ./a51table --condition rounds:rounds=32 --roundfunc
>> >> >> xor:condition=distinguished_point::bits=15:generator=lfsr::tablesize=32::advance=139584
>> >> >> --implementation sharedmem --algorithm A51 --device
>> >> >> cuda:operations=512 --work random:prefix=11,0 --consume
>> >> >> file:prefix=data:append --logger normal generate --chains 380000000
>> >> >> --chainlength 3000000 --intermediate filter:runlength=512
>> >> >> NVIDIA: could not open the device file /dev/nvidia0 (Input/output 
>> >> >> error).
>> >> >> Initialize implementation sharedmem...
>> >> >> ../tmto/round_function/generator/sharedmem_methods.hpp(12)[void
>> >> >> tmto::round_function::generator::host_part<tmto::round_function::generator::tag::sharedmem<Real>
>> >> >> >::copyToDevice() const [with Real =
>> >> >> tmto::round_function::generator::tag::lfsr]]: cuda error: no
>> >> >> CUDA-capable device is available
>> >> >>
>> >> >> Im running on two GeForce GTX 260's
>> >> >>
>> >> >> Regards Kugg
>> >> >>
>> >> >> On 10/4/09, Christoffer Jerkeby <[email protected]> wrote:
>> >> >> > Hi I got the same error, I was using the configuration generated from
>> >> >> > http://reflextor.com/cgi-bin/a51/a51id.cgi .
>> >> >> >
>> >> >> > $ ./a51table --condition rounds:rounds=32 --roundfunc
>> >> >> > xor:condition=distinguished_point::bits=15:generator=lfsr::tablesize=32::advance=139584
>> >> >> > --implementation sharedmem --algorithm A51 --device
>> >> >> > cuda:operations=512 --work random:prefix=11,0 --consume
>> >> >> > file:prefix=data:append --logger normal generate --chains 380000000
>> >> >> > --chainlength 3000000 --intermediate filter:runlength=512
>> >> >> >
>> >> >> > Initialize implementation sharedmem...
>> >> >> > 148 chains done, current rate 2.47 chains/sec (interval: 00:01:00)
>> >> >> > 6639 chains done, current rate 108.18 chains/sec (interval: 00:01:00)
>> >> >> > 10356 chains done, current rate 61.95 chains/sec (interval: 00:01:00)
>> >> >> > 14655 chains done, current rate 71.65 chains/sec (interval: 00:01:00)
>> >> >> > 19769 chains done, current rate 85.23 chains/sec (interval: 00:01:00)
>> >> >> > 24015 chains done, current rate 70.77 chains/sec (interval: 00:01:00)
>> >> >> > 28610 chains done, current rate 76.58 chains/sec (interval: 00:01:00)
>> >> >> > ../tmto/device/cuda/host_side_methods.hpp(76)[void
>> >> >> > tmto::device::cuda::cudaSynchronize()]: cuda error: unspecified 
>> >> >> > launch
>> >> >> > failure
>> >> >> >
>> >> >> > Regards Kugg
>> >> >> >
>> >> >> > On 10/2/09, Sascha Krissler <[email protected]> wrote:
>> >> >> >> gotta love those specific cuda error codes.
>> >> >> >> does it happen more than just once?
>> >> >> >> did you use any form of signaling through the fifo, like change 
>> >> >> >> number of
>> >> >> >> operations?
>> >> >> >> (if it happens more frequently) does it always happen on the same 
>> >> >> >> card?
>> >> >> >> at which positions? (chains done).
>> >> >> >>
>> >> >> >>> Hi,
>> >> >> >>>
>> >> >> >>> after some time (around 2 hours) i get this error:
>> >> >> >>>
>> >> >> >>> 1334412 chains done, current rate 141.42 chains/sec (interval: 
>> >> >> >>> 00:01:00)
>> >> >> >>> ../tmto/device/cuda/host_side_methods.hpp(76)[void
>> >> >> >>> tmto::device::cuda::cudaSynchronize()]: cuda error: unspecified 
>> >> >> >>> launch
>> >> >> >>> failure
>> >> >> >>>
>> >> >> >>> this happens only on 1 process, other processes on this machine are
>> >> >> >>> still running..
>> >> >> >>>
>> >
>> > ______________________________________________________
>> > GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
>> > Jetzt freischalten unter http://movieflat.web.de
>> >
>> > _______________________________________________
>> > A51 mailing list
>> > [email protected]
>> > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
>> >
>>
>
>
> _______________________________________________________________
> Neu: WEB.DE DSL bis 50.000 kBit/s und 200,- Euro Startguthaben!
> http://produkte.web.de/go/02/
>
> _______________________________________________
> A51 mailing list
> [email protected]
> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
>
_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Re: [A51] Error

Reply via email to