Well I think your picture demonstrates why its hard to beat? all the 'huge numbers' are in the root arcs which are easily cached (and cached for latin-1 by default).
So its bad in that it bloats the FST, but ultimately bloats it where it doesn't hurt. Of course if someone wants to use this for say, CJK, then we need to add an option to cache more root arcs (just like Kuromoji does). If it costs you 1 or 2MB ram to have a faster suggester... i think thats an ok tradeoff. but yeah, if there is a more efficient way thats also not bloated, i'm interested! On Fri, Mar 30, 2012 at 2:23 PM, Dawid Weiss <[email protected]> wrote: > Clear, I didn't think of that and had positive weights in my mind > only. Thanks for explanation. > > Dawid > > On Fri, Mar 30, 2012 at 8:17 PM, Robert Muir <[email protected]> wrote: >> I think its listed in the documentation (at least the jira issue!). >> Now that shortestPaths() is generified to any outputs, please feel >> free to find a more compact/faster output!!!!!!! >> >> Basically I experimented with tons of ideas (but the simplest ugliest >> hack was most successful it seemed?) >> >> we take the cost, and encode it as Integer.MAX_VALUE - cost to turn it >> into a "weight". >> >> This is bad because its huge vints, but other approaches i tried were >> sketchier, requiring encoding of negatives and such (see the recent >> JIRA issue about that which i tripped in the process!) >> >> I think i wrote 5 or 6 outputs impls trying to make this cleaner >> before just deciding to commit the simple approach... >> >> On Fri, Mar 30, 2012 at 2:13 PM, Dawid Weiss >> <[email protected]> wrote: >>> I'm doing this on trunk: >>> >>> >>> WFSTCompletionLookup cl = new WFSTCompletionLookup(); >>> TermFreq [] input = new TermFreq [] { >>> new TermFreq("cat", 0), >>> new TermFreq("chat", 2), >>> new TermFreq("fat", 3), >>> new TermFreq("feat", 1), >>> new TermFreq("sea", 0), >>> new TermFreq("seat", 3), >>> new TermFreq("swat", 0), >>> new TermFreq("sweat", 3), >>> }; >>> cl.build(new TermFreqArrayIterator(input)); >>> >>> StringWriter sw = new StringWriter(); >>> Field f = cl.getClass().getDeclaredField("fst"); >>> f.setAccessible(true); >>> FST<?> fst = (FST<?>) f.get(cl); >>> Util.toDot(fst, sw, false, true); >>> >>> for (TermFreq tf : input) { >>> List<LookupResult> lookup = >>> cl.lookup(tf.term.utf8ToString(), true, 1); >>> System.out.println(lookup.get(0)); >>> } >>> >>> System.out.println(sw.toString()); >>> >>> The output is attached. How come the first arcs have such high outputs? >>> >>> Dawid >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >> >> >> >> -- >> lucidimagination.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
