If you want to retain N words, perhaps a priority queue would be useful?

http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue

I'd be cautious about drawing many coding lessons from the TextAnalysis 
package, which has been never been optimized for performance.

 -- John

On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA <[email protected]> wrote:

> Thanks for Pontus's kind explanation. He answered what I want to know.
> I want to know the standard way to create dictionary (which is a set of words 
> for ASR or NLP).
> 
> To create dictionary for speech recognition or something NLP, we often 
> control size of vocabulary. There are two ways to limit size of vocabulary, 
> one is to cut under threshold frequency that Pontus showed, and the other is 
> to pick up top N frequent words (ngram tool kit such as IRSTLM supports this 
> situation and it is popular way to control necessary memory size). If I want 
> to pick frequent words, I think I'll use DataFrame.
> 
> On Tue Dec 16 2014 at 15:31:00 Todd Leo <[email protected]> wrote:
> Could you provide any clue to guide me locate the issue? I'm willing to make 
> a PR but I am unable to find the related issue.
> 
> 
> On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:
> There is not, but if I recall, there may be an open issue about this 
> functionality.
> 
> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo <[email protected]> wrote:
> Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?
> 
> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
> We have a select function as part of Base, which can do O(n) selection of the 
> top n:
> 
> julia> v = randn(10^7);
> 
> julia> let w = copy(v); @time sort!(w)[1:1000]; end;
> elapsed time: 0.882989281 seconds (8168 bytes allocated)
> 
> julia> let w = copy(v); @time select!(w,1:1000); end;
> elapsed time: 0.054981192 seconds (8192 bytes allocated)
> 
> So for large arrays, this is substantially faster.
> 
> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller <[email protected]> wrote:
> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
> 
> Here's one of many Q on how 
> http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array
> 
> 

Reply via email to