Thanks for Pontus's kind explanation. He answered what I want to know. I want to know the standard way to create dictionary (which is a set of words for ASR or NLP).
To create dictionary for speech recognition or something NLP, we often control size of vocabulary. There are two ways to limit size of vocabulary, one is to cut under threshold frequency that Pontus showed, and the other is to pick up top N frequent words (ngram tool kit such as IRSTLM supports this situation and it is popular way to control necessary memory size). If I want to pick frequent words, I think I'll use DataFrame. On Tue Dec 16 2014 at 15:31:00 Todd Leo <[email protected]> wrote: > Could you provide any clue to guide me locate the issue? I'm willing to > make a PR but I am unable to find the related issue. > > > On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote: > >> There is not, but if I recall, there may be an open issue about this >> functionality. >> > >> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo <[email protected]> wrote: >> >>> Is there a partial sort equivalent to sortperm! ? Supposingly >>> selectperm! ? >>> >>> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: >>>> >>>> We have a select function as part of Base, which can do O(n) selection >>>> of the top n: >>>> >>>> julia> v = randn(10^7); >>>> >>>> julia> let w = copy(v); @time sort!(w)[1:1000]; end; >>>> elapsed time: 0.882989281 seconds (8168 bytes allocated) >>>> >>>> julia> let w = copy(v); @time select!(w,1:1000); end; >>>> elapsed time: 0.054981192 seconds (8192 bytes allocated) >>>> >>>> >>>> So for large arrays, this is substantially faster. >>>> >>>> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller <[email protected]> wrote: >>>> >>>>> This can be done in O(N). Avoid sorting as it will be O(NlogN) >>>>> >>>>> Here's one of many Q on how http://stackoverflow.com/q >>>>> uestions/7272534/finding-the-first-n-largest-elements-in-an-array >>>>> >>>> >>>> >>
