Hi, I recently added to experiment.perl an option to first copy all big model files to local disk before running the decoder.
For this, you just need to set the parameter cache-model = "/scratch/disk/path" in the [GENERAL] section. This works well in our GridEngine setup. -phi On Tue, Apr 12, 2016 at 9:03 AM, Ondrej Bojar <[email protected]> wrote: > Hi, > > back to your question on getting the files on local disks where tuning > jobs will run: This was never easy with the current implementation, but in > fact, with multithreaded moses, the benefit of parallelizing across nodes > is vanishing. > > So I'd pass some queue-parameters to force the job to land on one of a > very few nodes that will have the files already there. > > Also, we have all our temps cross-mounted, so what I sometimes do is to > let the job run anywhere but take the data from the local temp of another > fixed machine. Yes, this is wasting network but relieving the flooded (or > incapable) main file server. > > Cheers, O. > > ----- Original Message ----- > > From: "Jorg Tiedemann" <[email protected]> > > To: "Kenneth Heafield" <[email protected]> > > Cc: [email protected] > > Sent: Tuesday, 12 April, 2016 14:45:57 > > Subject: Re: [Moses-support] loading time for large LMs > > > Well, this is on a shared login node and maybe not very representative > for other > > nodes in the cluster. > > I can see if I can get a more representative figure. > > But it’s quite busy on our cluster right now …. > > > > > > All the best, > > Jörg > > > > > > Jörg Tiedemann > > [email protected] > > > > > > > > > > > > > >> On 12 Apr 2016, at 14:54, Kenneth Heafield <[email protected]> wrote: > >> > >> Hi, > >> > >> Why is your system using 7 GB of swap out of 9 GB? Moses is only > >> taking 147 GB out of 252 GB physical. I smell other processes taking up > >> RAM, possibly those 5 stopped and 1 zombie. > >> > >> Kenneth > >> > >> On 04/12/2016 12:45 PM, Jorg Tiedemann wrote: > >>> > >>>> > >>>> Did you remove all "lazyken" arguments from moses.ini? > >>> > >>> Yes, I did. > >>> > >>>> > >>>> Is the network filesystem Lustre? If so, mmap will perform terribly > and > >>>> you should use load=read or (better) load=parallel_read since reading > >>>> from Lustre is CPU-bound. > >>>> > >>> > >>> Yes, I think so. Interesting with the parallel_read option. Can this > >>> hurt for some setups or could I use this as my standard? > >>> > >>> > >>>> Does the cluster management software/job scheduler/sysadmin impose a > >>>> resident memory limit? > >>>> > >>> > >>> I don’t really know. I don’t really think so but I need to find out > >>> > >>> > >>>> Can you copy-paste `top' when it's running slow and the stderr at that > >>>> time? > >>> > >>> > >>> Here is top of my top when running on my test node: > >>> > >>> top - 14:39:03 up 50 days, 5:47, 0 users, load average: 1.97, 2.09, > 3.85 > >>> Tasks: 814 total, 3 running, 805 sleeping, 5 stopped, 1 zombie > >>> Cpu(s): 6.9%us, 6.2%sy, 0.0%ni, 86.9%id, 0.0%wa, 0.0%hi, 0.0%si, > >>> 0.0%st > >>> Mem: 264493500k total, 263614188k used, 879312k free, 68680k > buffers > >>> Swap: 9775548k total, 7198920k used, 2576628k free, 69531796k cached > >>> > >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>> 42528 tiedeman 20 0 147g 147g 800 R 100.0 58.4 31:25.01 moses > >>> > >>> stderr doesn’t say anything new besides of the message from starting > the > >>> feature function loading > >>> > >>> FeatureFunction: LM0 start: 16 end: 16 > >>> line=KENLM load=parallel_read name=LM1 factor=0 > >>> > path=/homeappl/home/tiedeman/research/SMT/wmt16/fi-en/data/monolingual/cc.tok.3.en.trie.kenlm > >>> order=3 > >>> > >>> > >>> I try with /tmp/ now as well (it takes time to shuffle around the big > >>> files though). > >>> > >>> Jörg > >>> > >>> > >>>> > >>>> On 04/12/2016 08:26 AM, Jorg Tiedemann wrote: > >>>>> > >>>>> No, it’s definitely not waiting for input … the same setup works for > >>>>> smaller models. > >>>>> > >>>>> I have the models on a work partition on our cluster. > >>>>> This is probably not good enough and I will try to move data to local > >>>>> tmp on the individual nodes before executing. > >>>>> Hopefully this helps. How would you do this if you want to distribute > >>>>> tuning? > >>>>> > >>>>> Thanks! > >>>>> Jörg > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> On 12 Apr 2016, at 09:34, Ondrej Bojar <[email protected] > >>>>>> <mailto:[email protected]> > >>>>>> <mailto:[email protected] <mailto:[email protected]>>> > wrote: > >>>>>> > >>>>>> Random suggestion: isn't it waiting for stdin for some strange > >>>>>> reason? ;-) > >>>>>> > >>>>>> O. > >>>>>> > >>>>>> > >>>>>> On April 12, 2016 8:20:46 AM CEST, Hieu Hoang <[email protected] > >>>>>> <mailto:[email protected]> > >>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: > >>>>>>> I assume that it's on local disk rather than a network drive. > >>>>>>> > >>>>>>> Are you sure it's still in the loading stage, and that it's loading > >>>>>>> kenlm, > >>>>>>> rather than the pt or lexicalized reordering model etc? > >>>>>>> > >>>>>>> If there's a way to make the model files available for download or > to > >>>>>>> give > >>>>>>> me access your machine, i might be able to debug it > >>>>>>> > >>>>>>> Hieu Hoang > >>>>>>> http://www.hoang.co.uk/hieu <http://www.hoang.co.uk/hieu> > >>>>>>> On 12 Apr 2016 08:41, "Jorg Tiedemann" <[email protected] > >>>>>>> <mailto:[email protected]> > >>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> Unfortunately, load=read didn’t help. It’s been loading for 7 > hours > >>>>>>> now > >>>>>>>> and no sign to start decoding. > >>>>>>>> The disk is not terribly slow. cat worked without problem. I don’t > >>>>>>> know > >>>>>>>> what to do but I think that I have to give up for now. > >>>>>>>> Am I the only one who is experiencing such slow loading times? > >>>>>>>> > >>>>>>>> Thanks again for your help! > >>>>>>>> > >>>>>>>> Jörg > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 10 Apr 2016, at 22:27, Kenneth Heafield <[email protected] > >>>>>>>> <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> With load=read: > >>>>>>>> > >>>>>>>> Act like normal RAM as part of the Moses process. > >>>>>>>> > >>>>>>>> Supports huge pages via transparent huge pages, so it's slightly > >>>>>>> faster. > >>>>>>>> > >>>>>>>> Before loading cat file >/dev/null will just put things into cache > >>>>>>> that > >>>>>>>> were going to be read more or less like cat anyway. > >>>>>>>> > >>>>>>>> After loading cat file >/dev/null will hurt since there's the > >>>>>>> potential > >>>>>>>> to load the file into RAM twice and swap out bits of Moses. > >>>>>>>> > >>>>>>>> Memory is shared between threads, just not with the disk cache (ok > >>>>>>>> maybe, but only if they get huge pages support to work well) or > other > >>>>>>>> processes that independently read the file. > >>>>>>>> > >>>>>>>> With load=populate: > >>>>>>>> > >>>>>>>> Load upfront, map it into the process, kernel seems to evict it > >>>>>>> first. > >>>>>>>> > >>>>>>>> Before loading cat file >/dev/null might help, but in theory > >>>>>>>> MAP_POPULATE should be doing much the same thing. > >>>>>>>> > >>>>>>>> After loading or during slow loading cat file >/dev/null can help > >>>>>>>> because it forces the data back into RAM. This is particularly > >>>>>>> useful > >>>>>>>> if the Moses process came under memory pressure after loading, > which > >>>>>>> can > >>>>>>>> include heavy disk activity even if RAM isn't full. > >>>>>>>> > >>>>>>>> Memory is shared with all other processes that mmap. > >>>>>>>> > >>>>>>>> With load=lazy: > >>>>>>>> > >>>>>>>> Map into the process with lazy loading (i.e. mmap without > >>>>>>> MAP_POPULATE). > >>>>>>>> Not recommended for decoding, but useful if you've got a 6 TB file > >>>>>>> and > >>>>>>>> want to send it a few 1000 queries. > >>>>>>>> > >>>>>>>> cat will definitely help here at any time. > >>>>>>>> > >>>>>>>> Memory is shared with all other processes that mmap. > >>>>>>>> > >>>>>>>> On 04/10/2016 06:50 PM, Jorg Tiedemann wrote: > >>>>>>>> > >>>>>>>> Thanks for the quick reply. > >>>>>>>> I will try the load option. > >>>>>>>> > >>>>>>>> Quick question: You said that the memory will not be shared across > >>>>>>>> processes with that option. Does that mean that it will load the > LM > >>>>>>> for > >>>>>>>> each thread? That would mean a lot in my setup. > >>>>>>>> > >>>>>>>> By the way, I also did the cat >/dev/null thing but I didn’t have > the > >>>>>>>> impression that this changed a lot. Does it really help and how > much > >>>>>>>> would you usually gain? Thanks again! > >>>>>>>> > >>>>>>>> > >>>>>>>> Jörg > >>>>>>>> > >>>>>>>> > >>>>>>>> On 10 Apr 2016, at 12:55, Kenneth Heafield <[email protected] > >>>>>>>> <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]> < > [email protected] > >>>>>>>> <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>>> > wrote: > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I'm assuming you have enough RAM to fit everything. The kernel > seems > >>>>>>>> to preferentially evict mmapped pages as memory usage approaches > full > >>>>>>>> (it doesn't have to be full). To work around this, use > >>>>>>>> > >>>>>>>> load=read > >>>>>>>> > >>>>>>>> in your moses.ini line for the models. REMOVE any "lazyken" > argument > >>>>>>>> which is deprecated and might override the load= argument. > >>>>>>>> > >>>>>>>> The effect of load=read is to malloc (ok, anonymous mmap which is > how > >>>>>>>> malloc is implemented anyway) at a 1 GB aligned address (to > optimize > >>>>>>> for > >>>>>>>> huge pages) and read() the file into that memory. It will no > longer > >>>>>>>> share across processes, but memory will have the same swapiness as > >>>>>>> the > >>>>>>>> rest of the Moses process. > >>>>>>>> > >>>>>>>> Lazy loading will only make things worse here. > >>>>>>>> > >>>>>>>> Kenneth > >>>>>>>> > >>>>>>>> On 04/10/2016 07:29 AM, Jorg Tiedemann wrote: > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I have a large language model from the common crawl data set and > it > >>>>>>>> takes forever to load when running moses. > >>>>>>>> My model is a trigram kenlm binarized with quantization, trie > >>>>>>> structures > >>>>>>>> and pointer compression (-a 22 -q 8 -b 8). > >>>>>>>> The model is about 140GB and it takes hours to load (I’m still > >>>>>>> waiting). > >>>>>>>> I run on a machine with 256GB RAM ... > >>>>>>>> > >>>>>>>> I also tried lazy loading without success. Is this normal or do I > do > >>>>>>>> something wrong? > >>>>>>>> Thanks for your help! > >>>>>>>> > >>>>>>>> Jörg > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Moses-support mailing list > >>>>>>>> [email protected] <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]> > >>>>>>>> <[email protected] <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>> > >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Moses-support mailing list > >>>>>>>> [email protected] <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]> > >>>>>>>> <[email protected] <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>> > >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Moses-support mailing list > >>>>>>>> [email protected] <mailto:[email protected]> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > ------------------------------------------------------------------------ > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Moses-support mailing list > >>>>>>> [email protected] <mailto:[email protected]> > >>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support> > >>>>>> > >>>>>> -- > >>>>>> Ondrej Bojar (mailto:[email protected] <mailto:[email protected]> / > [email protected] > >>>>>> <mailto:[email protected]> > >>>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>>> <mailto:[email protected] <mailto:[email protected]>>) > >>>>>> http://www.cuni.cz/~obo <http://www.cuni.cz/~obo> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Moses-support mailing list > >>>>> [email protected] <mailto:[email protected]> > >>>>> <mailto:[email protected] <mailto:[email protected]>> > >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support> > >>>>> > >>>> _______________________________________________ > >>>> Moses-support mailing list > >>>> [email protected] <mailto:[email protected]> > >>>> <mailto:[email protected] <mailto:[email protected]>> > >>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>> <http://mailman.mit.edu/mailman/listinfo/moses-support> > >>> > >> _______________________________________________ > >> Moses-support mailing list > >> [email protected] <mailto:[email protected]> > >> http://mailman.mit.edu/mailman/listinfo/moses-support > >> <http://mailman.mit.edu/mailman/listinfo/moses-support> > > > > _______________________________________________ > > Moses-support mailing list > > [email protected] > > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- > Ondrej Bojar (mailto:[email protected] / [email protected]) > http://www.cuni.cz/~obo > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
