Re: [Moses-support] loading time for large LMs

Philipp Koehn Thu, 14 Apr 2016 14:07:19 -0700

Hi,

I recently added to experiment.perl an option to first copy all big model
files to local disk before running the decoder.


For this, you just need to set the parameter
cache-model = "/scratch/disk/path"
in the [GENERAL] section.

This works well in our GridEngine setup.

-phi

On Tue, Apr 12, 2016 at 9:03 AM, Ondrej Bojar <[email protected]>
wrote:

> Hi,
>
> back to your question on getting the files on local disks where tuning
> jobs will run: This was never easy with the current implementation, but in
> fact, with multithreaded moses, the benefit of parallelizing across nodes
> is vanishing.
>
> So I'd pass some queue-parameters to force the job to land on one of a
> very few nodes that will have the files already there.
>
> Also, we have all our temps cross-mounted, so what I sometimes do is to
> let the job run anywhere but take the data from the local temp of another
> fixed machine. Yes, this is wasting network but relieving the flooded (or
> incapable) main file server.
>
> Cheers, O.
>
> ----- Original Message -----
> > From: "Jorg Tiedemann" <[email protected]>
> > To: "Kenneth Heafield" <[email protected]>
> > Cc: [email protected]
> > Sent: Tuesday, 12 April, 2016 14:45:57
> > Subject: Re: [Moses-support] loading time for large LMs
>
> > Well, this is on a shared login node and maybe not very representative
> for other
> > nodes in the cluster.
> > I can see if I can get a more representative figure.
> > But it’s quite busy on our cluster right now ….
> >
> >
> > All the best,
> > Jörg
> >
> >
> > Jörg Tiedemann
> > [email protected]
> >
> >
> >
> >
> >
> >
> >> On 12 Apr 2016, at 14:54, Kenneth Heafield <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >>      Why is your system using 7 GB of swap out of 9 GB?  Moses is only
> >> taking 147 GB out of 252 GB physical.  I smell other processes taking up
> >> RAM, possibly those 5 stopped and 1 zombie.
> >>
> >> Kenneth
> >>
> >> On 04/12/2016 12:45 PM, Jorg Tiedemann wrote:
> >>>
> >>>>
> >>>> Did you remove all "lazyken" arguments from moses.ini?
> >>>
> >>> Yes, I did.
> >>>
> >>>>
> >>>> Is the network filesystem Lustre?  If so, mmap will perform terribly
> and
> >>>> you should use load=read or (better) load=parallel_read since reading
> >>>> from Lustre is CPU-bound.
> >>>>
> >>>
> >>> Yes, I think so. Interesting with the parallel_read option. Can this
> >>> hurt for some setups or could I use this as my standard?
> >>>
> >>>
> >>>> Does the cluster management software/job scheduler/sysadmin impose a
> >>>> resident memory limit?
> >>>>
> >>>
> >>> I don’t really know. I don’t really think so but I need to find out
> >>>
> >>>
> >>>> Can you copy-paste `top' when it's running slow and the stderr at that
> >>>> time?
> >>>
> >>>
> >>> Here is top of my top when running on my test node:
> >>>
> >>> top - 14:39:03 up 50 days,  5:47,  0 users,  load average: 1.97, 2.09,
> 3.85
> >>> Tasks: 814 total,   3 running, 805 sleeping,   5 stopped,   1 zombie
> >>> Cpu(s):  6.9%us,  6.2%sy,  0.0%ni, 86.9%id,  0.0%wa,  0.0%hi,  0.0%si,
> >>> 0.0%st
> >>> Mem:  264493500k total, 263614188k used,   879312k free,    68680k
> buffers
> >>> Swap:  9775548k total,  7198920k used,  2576628k free, 69531796k cached
> >>>
> >>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>> 42528 tiedeman  20   0  147g 147g  800 R 100.0 58.4  31:25.01 moses
> >>>
> >>> stderr doesn’t say anything new besides of the message from starting
> the
> >>> feature function loading
> >>>
> >>> FeatureFunction: LM0 start: 16 end: 16
> >>> line=KENLM load=parallel_read name=LM1 factor=0
> >>>
> path=/homeappl/home/tiedeman/research/SMT/wmt16/fi-en/data/monolingual/cc.tok.3.en.trie.kenlm
> >>> order=3
> >>>
> >>>
> >>> I try with /tmp/ now as well (it takes time to shuffle around the big
> >>> files though).
> >>>
> >>> Jörg
> >>>
> >>>
> >>>>
> >>>> On 04/12/2016 08:26 AM, Jorg Tiedemann wrote:
> >>>>>
> >>>>> No, it’s definitely not waiting for input … the same setup works for
> >>>>> smaller models.
> >>>>>
> >>>>> I have the models on a work partition on our cluster.
> >>>>> This is probably not good enough and I will try to move data to local
> >>>>> tmp on the individual nodes before executing.
> >>>>> Hopefully this helps. How would you do this if you want to distribute
> >>>>> tuning?
> >>>>>
> >>>>> Thanks!
> >>>>> Jörg
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 12 Apr 2016, at 09:34, Ondrej Bojar <[email protected]
> >>>>>> <mailto:[email protected]>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>>>>>
> >>>>>> Random suggestion: isn't it waiting for stdin for some strange
> >>>>>> reason? ;-)
> >>>>>>
> >>>>>> O.
> >>>>>>
> >>>>>>
> >>>>>> On April 12, 2016 8:20:46 AM CEST, Hieu Hoang <[email protected]
> >>>>>> <mailto:[email protected]>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>> I assume that it's on local disk rather than a network drive.
> >>>>>>>
> >>>>>>> Are you sure it's still in the loading stage, and that it's loading
> >>>>>>> kenlm,
> >>>>>>> rather than the pt or lexicalized reordering model etc?
> >>>>>>>
> >>>>>>> If there's a way to make the model files available for download or
> to
> >>>>>>> give
> >>>>>>> me access your machine, i might be able to debug it
> >>>>>>>
> >>>>>>> Hieu Hoang
> >>>>>>> http://www.hoang.co.uk/hieu <http://www.hoang.co.uk/hieu>
> >>>>>>> On 12 Apr 2016 08:41, "Jorg Tiedemann" <[email protected]
> >>>>>>> <mailto:[email protected]>
> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Unfortunately, load=read didn’t help. It’s been loading for 7
> hours
> >>>>>>> now
> >>>>>>>> and no sign to start decoding.
> >>>>>>>> The disk is not terribly slow. cat worked without problem. I don’t
> >>>>>>> know
> >>>>>>>> what to do but I think that I have to give up for now.
> >>>>>>>> Am I the only one who is experiencing such slow loading times?
> >>>>>>>>
> >>>>>>>> Thanks again for your help!
> >>>>>>>>
> >>>>>>>> Jörg
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 10 Apr 2016, at 22:27, Kenneth Heafield <[email protected]
> >>>>>>>> <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> With load=read:
> >>>>>>>>
> >>>>>>>> Act like normal RAM as part of the Moses process.
> >>>>>>>>
> >>>>>>>> Supports huge pages via transparent huge pages, so it's slightly
> >>>>>>> faster.
> >>>>>>>>
> >>>>>>>> Before loading cat file >/dev/null will just put things into cache
> >>>>>>> that
> >>>>>>>> were going to be read more or less like cat anyway.
> >>>>>>>>
> >>>>>>>> After loading cat file >/dev/null will hurt since there's the
> >>>>>>> potential
> >>>>>>>> to load the file into RAM twice and swap out bits of Moses.
> >>>>>>>>
> >>>>>>>> Memory is shared between threads, just not with the disk cache (ok
> >>>>>>>> maybe, but only if they get huge pages support to work well) or
> other
> >>>>>>>> processes that independently read the file.
> >>>>>>>>
> >>>>>>>> With load=populate:
> >>>>>>>>
> >>>>>>>> Load upfront, map it into the process, kernel seems to evict it
> >>>>>>> first.
> >>>>>>>>
> >>>>>>>> Before loading cat file >/dev/null might help, but in theory
> >>>>>>>> MAP_POPULATE should be doing much the same thing.
> >>>>>>>>
> >>>>>>>> After loading or during slow loading cat file >/dev/null can help
> >>>>>>>> because it forces the data back into RAM.  This is particularly
> >>>>>>> useful
> >>>>>>>> if the Moses process came under memory pressure after loading,
> which
> >>>>>>> can
> >>>>>>>> include heavy disk activity even if RAM isn't full.
> >>>>>>>>
> >>>>>>>> Memory is shared with all other processes that mmap.
> >>>>>>>>
> >>>>>>>> With load=lazy:
> >>>>>>>>
> >>>>>>>> Map into the process with lazy loading (i.e. mmap without
> >>>>>>> MAP_POPULATE).
> >>>>>>>> Not recommended for decoding, but useful if you've got a 6 TB file
> >>>>>>> and
> >>>>>>>> want to send it a few 1000 queries.
> >>>>>>>>
> >>>>>>>> cat will definitely help here at any time.
> >>>>>>>>
> >>>>>>>> Memory is shared with all other processes that mmap.
> >>>>>>>>
> >>>>>>>> On 04/10/2016 06:50 PM, Jorg Tiedemann wrote:
> >>>>>>>>
> >>>>>>>> Thanks for the quick reply.
> >>>>>>>> I will try the load option.
> >>>>>>>>
> >>>>>>>> Quick question: You said that the memory will not be shared across
> >>>>>>>> processes with that option. Does that mean that it will load the
> LM
> >>>>>>> for
> >>>>>>>> each thread? That would mean a lot in my setup.
> >>>>>>>>
> >>>>>>>> By the way, I also did the cat >/dev/null thing but I didn’t have
> the
> >>>>>>>> impression that this changed a lot. Does it really help and how
> much
> >>>>>>>> would you usually gain? Thanks again!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Jörg
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 10 Apr 2016, at 12:55, Kenneth Heafield <[email protected]
> >>>>>>>> <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]> <
> [email protected]
> >>>>>>>> <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>>>
> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I'm assuming you have enough RAM to fit everything.  The kernel
> seems
> >>>>>>>> to preferentially evict mmapped pages as memory usage approaches
> full
> >>>>>>>> (it doesn't have to be full).  To work around this, use
> >>>>>>>>
> >>>>>>>> load=read
> >>>>>>>>
> >>>>>>>> in your moses.ini line for the models.  REMOVE any "lazyken"
> argument
> >>>>>>>> which is deprecated and might override the load= argument.
> >>>>>>>>
> >>>>>>>> The effect of load=read is to malloc (ok, anonymous mmap which is
> how
> >>>>>>>> malloc is implemented anyway) at a 1 GB aligned address (to
> optimize
> >>>>>>> for
> >>>>>>>> huge pages) and read() the file into that memory.  It will no
> longer
> >>>>>>>> share across processes, but memory will have the same swapiness as
> >>>>>>> the
> >>>>>>>> rest of the Moses process.
> >>>>>>>>
> >>>>>>>> Lazy loading will only make things worse here.
> >>>>>>>>
> >>>>>>>> Kenneth
> >>>>>>>>
> >>>>>>>> On 04/10/2016 07:29 AM, Jorg Tiedemann wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I have a large language model from the common crawl data set and
> it
> >>>>>>>> takes forever to load when running moses.
> >>>>>>>> My model is a trigram kenlm binarized with quantization, trie
> >>>>>>> structures
> >>>>>>>> and pointer compression (-a 22 -q 8 -b 8).
> >>>>>>>> The model is about 140GB and it takes hours to load (I’m still
> >>>>>>> waiting).
> >>>>>>>> I run on a machine with 256GB RAM ...
> >>>>>>>>
> >>>>>>>> I also tried lazy loading without success. Is this normal or do I
> do
> >>>>>>>> something wrong?
> >>>>>>>> Thanks for your help!
> >>>>>>>>
> >>>>>>>> Jörg
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Moses-support mailing list
> >>>>>>>> [email protected] <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>
> >>>>>>>> <[email protected] <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>>
> >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Moses-support mailing list
> >>>>>>>> [email protected] <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>
> >>>>>>>> <[email protected] <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>>
> >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Moses-support mailing list
> >>>>>>>> [email protected] <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ------------------------------------------------------------------------
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Moses-support mailing list
> >>>>>>> [email protected] <mailto:[email protected]>
> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >>>>>>
> >>>>>> --
> >>>>>> Ondrej Bojar (mailto:[email protected] <mailto:[email protected]> /
> [email protected]
> >>>>>> <mailto:[email protected]>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>)
> >>>>>> http://www.cuni.cz/~obo <http://www.cuni.cz/~obo>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Moses-support mailing list
> >>>>> [email protected] <mailto:[email protected]>
> >>>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >>>>>
> >>>> _______________________________________________
> >>>> Moses-support mailing list
> >>>> [email protected] <mailto:[email protected]>
> >>>> <mailto:[email protected] <mailto:[email protected]>>
> >>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >>>
> >> _______________________________________________
> >> Moses-support mailing list
> >> [email protected] <mailto:[email protected]>
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
> --
> Ondrej Bojar (mailto:[email protected] / [email protected])
> http://www.cuni.cz/~obo
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] loading time for large LMs

Reply via email to