Hey Sacha, I ran the nvidia memtest on one of the cards and found it corrupt. I returned the card but the other one is now doing a good 190-260Chains/s. When I get the other one back I will work on some optimization for the speed on both (from the advice on this list).
Thanks for your help and patients. Regards Kugg On Tue, Oct 20, 2009 at 9:16 PM, Sascha Krissler <[email protected]> wrote: > The number of chains per second < 100 suggest that your GPU is running at > 399Mhz. > My 260 is slowed down to this clockspeed after a launch failure in the kernel > which requires > a reboot to restore the original 576 Mhz clockspeed. is this the case? you > can see the > clockspeed with the nvidia-settings tool or with nvclock. The error never > happened on my box, > which is a GTX260 (single card) and i have already computed 3 month worth of > chains. > Hope your investigation brings more insight > >> This is probably going to sound wierd but "mostly no". >> So, I can run like this for ten times and get the same error over and >> over again. Eventually I will get a message saying no cuda devices >> available, which requires me to reboot. >> >> Sascha I will try to think of a way where I can help you reproduce it. >> >> I don't mind waiting for the new version. But I suspect this bug is in >> the cuda environment itself since I get the same error with the same >> frequency of problems when running for instance gpu_md5_bruteforce. >> >> I will play around with this some more during the coming days to see >> if I can come up with some more informative output. >> >> Thanks Kugg >> >> On Tue, Oct 20, 2009 at 9:12 PM, Sascha Krissler <[email protected]> >> wrote: >> > A way to reproduce would be cool. Otherwise i could >> > just ignore the error and recover+restart the cuda runtime context. >> > Since the new version changes quite a lot of things, i hope >> > the error does not occur there. >> > Do you have to reboot after those kinds of erros or can you just restart >> > the application? >> > >> > >> >> Hey sorry for my late reply, I suspect that this problem persist on >> >> other cuda applicaitons. >> >> How ever here is the information: >> >> 1. Linux version >> >> Linux kakmonstret 2.6.28-11-server #42-Ubuntu SMP Fri Apr 17 02:45:36 >> >> UTC 2009 x86_64 GNU/Linux >> >> >> >> 2. CPU info >> >> vendor_id : AuthenticAMD >> >> model name : AMD Athlon(tm) II X2 240 Processor >> >> cpu MHz : 2809.543 >> >> >> >> 3. GPU and driver info >> >> Device 0: "GeForce GTX 260" >> >> CUDA Driver Version: 2.30 >> >> CUDA Runtime Version: 2.30 >> >> CUDA Capability Major revision number: 1 >> >> CUDA Capability Minor revision number: 3 >> >> Total amount of global memory: 938803200 bytes >> >> Clock rate: 1.46 GHz >> >> >> >> Device 1: "GeForce GTX 260" >> >> CUDA Driver Version: 2.30 >> >> CUDA Runtime Version: 2.30 >> >> CUDA Capability Major revision number: 1 >> >> CUDA Capability Minor revision number: 3 >> >> Total amount of global memory: 939261952 bytes >> >> Clock rate: 1.46 GHz >> >> >> >> Ill try with the newer versions from: >> >> http://www.nvidia.com/object/cuda_get.html >> >> and will report back if it works. >> >> >> >> Regards Kugg >> >> >> >> On Mon, Oct 5, 2009 at 6:20 PM, Sascha Krissler <[email protected]> >> >> wrote: >> >> > i do not have a solution for this problem or a good guess what the >> >> > problem >> >> > is, so i ask you to wait for the next release and if the problem >> >> > remains i will >> >> > take a look at cuda-gdb and see whether it is usable or write a kernel >> >> > that generates >> >> > more debugging information. >> >> > cuda-gdb should be able to print information about the error, so if you >> >> > want to invest >> >> > time, you can try it out. it should be able to at least print the >> >> > source file line number >> >> > of the instruction that was responsible for the error in the case of >> >> > the failed cudaThreadSynchronize, >> >> > the error in the memcpy and the no_device_found are a different story >> >> > as no code is >> >> > executed on the GPU in that case. >> >> > maybe your drivers are too old. also if you are on a 32bit system you >> >> > have to compile >> >> > with -malign-double as enabled by default in the Makefile.local.dist. >> >> > Maybe you can post nvidia driver version, cpu arch and linux version. >> >> > >> >> >> Trying again gave me a similar error: >> >> >> $ ./a51table --condition rounds:rounds=32 --roundfunc >> >> >> xor:condition=distinguished_point::bits=15:generator=lfsr::tablesize=32::advance=139584 >> >> >> --implementation sharedmem --algorithm A51 --device >> >> >> cuda:operations=512 --work random:prefix=11,0 --consume >> >> >> file:prefix=data:append --logger normal generate --chains 380000000 >> >> >> --chainlength 3000000 --intermediate filter:runlength=512 >> >> >> Initialize implementation sharedmem... >> >> >> 106 chains done, current rate 1.77 chains/sec (interval: 00:01:00) >> >> >> 6633 chains done, current rate 108.78 chains/sec (interval: 00:01:00) >> >> >> 10350 chains done, current rate 61.95 chains/sec (interval: 00:01:00) >> >> >> 14632 chains done, current rate 71.37 chains/sec (interval: 00:01:00) >> >> >> 19810 chains done, current rate 86.30 chains/sec (interval: 00:01:00) >> >> >> ../tmto/device/cuda/working_set_methods.hpp(38)[void >> >> >> tmto::device::cuda::working_set::simple_host<T, >> >> >> Round>::copyToDevice(int) [with T = >> >> >> tmto::device::combined_work_item<tmto::algorithm::A51::data_type, >> >> >> tmto::configuration::state::state<void, void, >> >> >> tmto::condition::tag::rounds, >> >> >> tmto::round_function::arguments::selector<tmto::round_function::tag::xor_, >> >> >> tmto::condition::tag::distinguished_point, >> >> >> tmto::round_function::generator::tag::sharedmem<tmto::round_function::gen >> >> >> >> >> >> Trying one more time I got >> >> >> $ ./a51table --condition rounds:rounds=32 --roundfunc >> >> >> xor:condition=distinguished_point::bits=15:generator=lfsr::tablesize=32::advance=139584 >> >> >> --implementation sharedmem --algorithm A51 --device >> >> >> cuda:operations=512 --work random:prefix=11,0 --consume >> >> >> file:prefix=data:append --logger normal generate --chains 380000000 >> >> >> --chainlength 3000000 --intermediate filter:runlength=512 >> >> >> NVIDIA: could not open the device file /dev/nvidia0 (Input/output >> >> >> error). >> >> >> Initialize implementation sharedmem... >> >> >> ../tmto/round_function/generator/sharedmem_methods.hpp(12)[void >> >> >> tmto::round_function::generator::host_part<tmto::round_function::generator::tag::sharedmem<Real> >> >> >> >::copyToDevice() const [with Real = >> >> >> tmto::round_function::generator::tag::lfsr]]: cuda error: no >> >> >> CUDA-capable device is available >> >> >> >> >> >> Im running on two GeForce GTX 260's >> >> >> >> >> >> Regards Kugg >> >> >> >> >> >> On 10/4/09, Christoffer Jerkeby <[email protected]> wrote: >> >> >> > Hi I got the same error, I was using the configuration generated from >> >> >> > http://reflextor.com/cgi-bin/a51/a51id.cgi . >> >> >> > >> >> >> > $ ./a51table --condition rounds:rounds=32 --roundfunc >> >> >> > xor:condition=distinguished_point::bits=15:generator=lfsr::tablesize=32::advance=139584 >> >> >> > --implementation sharedmem --algorithm A51 --device >> >> >> > cuda:operations=512 --work random:prefix=11,0 --consume >> >> >> > file:prefix=data:append --logger normal generate --chains 380000000 >> >> >> > --chainlength 3000000 --intermediate filter:runlength=512 >> >> >> > >> >> >> > Initialize implementation sharedmem... >> >> >> > 148 chains done, current rate 2.47 chains/sec (interval: 00:01:00) >> >> >> > 6639 chains done, current rate 108.18 chains/sec (interval: 00:01:00) >> >> >> > 10356 chains done, current rate 61.95 chains/sec (interval: 00:01:00) >> >> >> > 14655 chains done, current rate 71.65 chains/sec (interval: 00:01:00) >> >> >> > 19769 chains done, current rate 85.23 chains/sec (interval: 00:01:00) >> >> >> > 24015 chains done, current rate 70.77 chains/sec (interval: 00:01:00) >> >> >> > 28610 chains done, current rate 76.58 chains/sec (interval: 00:01:00) >> >> >> > ../tmto/device/cuda/host_side_methods.hpp(76)[void >> >> >> > tmto::device::cuda::cudaSynchronize()]: cuda error: unspecified >> >> >> > launch >> >> >> > failure >> >> >> > >> >> >> > Regards Kugg >> >> >> > >> >> >> > On 10/2/09, Sascha Krissler <[email protected]> wrote: >> >> >> >> gotta love those specific cuda error codes. >> >> >> >> does it happen more than just once? >> >> >> >> did you use any form of signaling through the fifo, like change >> >> >> >> number of >> >> >> >> operations? >> >> >> >> (if it happens more frequently) does it always happen on the same >> >> >> >> card? >> >> >> >> at which positions? (chains done). >> >> >> >> >> >> >> >>> Hi, >> >> >> >>> >> >> >> >>> after some time (around 2 hours) i get this error: >> >> >> >>> >> >> >> >>> 1334412 chains done, current rate 141.42 chains/sec (interval: >> >> >> >>> 00:01:00) >> >> >> >>> ../tmto/device/cuda/host_side_methods.hpp(76)[void >> >> >> >>> tmto::device::cuda::cudaSynchronize()]: cuda error: unspecified >> >> >> >>> launch >> >> >> >>> failure >> >> >> >>> >> >> >> >>> this happens only on 1 process, other processes on this machine are >> >> >> >>> still running.. >> >> >> >>> >> > >> > ______________________________________________________ >> > GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT! >> > Jetzt freischalten unter http://movieflat.web.de >> > >> > _______________________________________________ >> > A51 mailing list >> > [email protected] >> > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 >> > >> > > > _______________________________________________________________ > Neu: WEB.DE DSL bis 50.000 kBit/s und 200,- Euro Startguthaben! > http://produkte.web.de/go/02/ > > _______________________________________________ > A51 mailing list > [email protected] > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 > _______________________________________________ A51 mailing list [email protected] http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
