Thanks for your answer, John. I'll use PriorityQueue to get top N words, and if I want to cut off using threshold I'll create a dictionary with Dict.
--- Michiaki On Tue Dec 16 2014 at 22:53:34 John Myles White <[email protected]> wrote: > If you want to retain N words, perhaps a priority queue would be useful? > > http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue > > I'd be cautious about drawing many coding lessons from the TextAnalysis > package, which has been never been optimized for performance. > > -- John > > On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA <[email protected]> wrote: > > Thanks for Pontus's kind explanation. He answered what I want to know. > I want to know the standard way to create dictionary (which is a set of > words for ASR or NLP). > > To create dictionary for speech recognition or something NLP, we often > control size of vocabulary. There are two ways to limit size of vocabulary, > one is to cut under threshold frequency that Pontus showed, and the other > is to pick up top N frequent words (ngram tool kit such as IRSTLM supports > this situation and it is popular way to control necessary memory size). If > I want to pick frequent words, I think I'll use DataFrame. > > On Tue Dec 16 2014 at 15:31:00 Todd Leo <[email protected]> wrote: > >> Could you provide any clue to guide me locate the issue? I'm willing to >> make a PR but I am unable to find the related issue. >> >> >> On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote: >> >>> There is not, but if I recall, there may be an open issue about this >>> functionality. >>> >> >>> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo <[email protected]> wrote: >>> >>>> Is there a partial sort equivalent to sortperm! ? Supposingly >>>> selectperm! ? >>>> >>>> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: >>>>> >>>>> We have a select function as part of Base, which can do O(n) selection >>>>> of the top n: >>>>> >>>>> julia> v = randn(10^7); >>>>> >>>>> julia> let w = copy(v); @time sort!(w)[1:1000]; end; >>>>> elapsed time: 0.882989281 seconds (8168 bytes allocated) >>>>> >>>>> julia> let w = copy(v); @time select!(w,1:1000); end; >>>>> elapsed time: 0.054981192 seconds (8192 bytes allocated) >>>>> >>>>> >>>>> So for large arrays, this is substantially faster. >>>>> >>>>> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller <[email protected]> wrote: >>>>> >>>>>> This can be done in O(N). Avoid sorting as it will be O(NlogN) >>>>>> >>>>>> Here's one of many Q on how http://stackoverflow.com/q >>>>>> uestions/7272534/finding-the-first-n-largest-elements-in-an-array >>>>>> >>>>> >>>>> >>> >
