[ 
https://issues.apache.org/jira/browse/LUCENE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196042#comment-13196042
 ] 

Michael McCandless commented on LUCENE-3725:
--------------------------------------------

Thanks Dawid.

bq. Yes, this is pretty much the top-n nodes reordering that I did, albeit 
without any optimization of how many n to take (the hardcoded magic constants 
should probably be justified somehow? Or replaced by a default in FST 
somewhere?).

bq. Deciding how many nodes to reorder is I think hard – I failed to provide 
any sensible fast heuristic for that and I simply do a simulated annealing to 
find a local optimum.

Yeah the algo is simplistic now... and it does force caller to pick the params 
(though minInCountDeref=3 and maxDerefNodes=Inf are probably pretty good)... we 
can and should make it more sophisticated over time.  We have at least one 
spare bit to still use in the arc flags :)

bq. One thing I was wondering is why you decided to integrate the packer with 
the fst – wouldn't it be cleaner to separate packing from construction? 
Granted, this requires a double pass over the fst nodes and more intermediate 
memory but it wouldn't add any more complexity to the builder which is already, 
ehm, a bit complex . You can compare this design in Morfologik:

Well... it's tricky.  First, I decided to change node targets to ords so that 
pack could use an array to (relatively!) efficiently hold node data, eg 
inCount, newAddress, etc.  That required making the first pass FST "modal" 
(deref'd or not).  If we didn't do this then the packer would have to use even 
less RAM efficient data structures (eg Map<Int,X>) I think?

Second, the format written by the packer is tightly coupled with the FST 
reading, ie there are sizable differences when reading packed vs unpacked FST.

But I agree it'd be cleaner if we could move packing out (eg Util.pack), and 
more strongly decouple packing from FST format...

One thing I'd really like to somehow do is create different classes for FST 
writing vs reading, and different classes for each format.  We now have 
write-ord, write-non-ord, read-packed, read-unpacked (and even the two readers 
have 3 different modes depending on INPUT_TYPE).


                
> Add optional packing to FST building
> ------------------------------------
>
>                 Key: LUCENE-3725
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3725
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3725.patch, LUCENE-3725.patch, LUCENE-3725.patch, 
> Perf.java
>
>
> The FSTs produced by Builder can be further shrunk if you are willing
> to spend highish transient RAM to do so... our Builder today tries
> hard not to use much RAM (and has options to tweak down the RAM usage,
> in exchange for somewhat lager FST), even when building immense FSTs.
> But for apps that can afford highish transient RAM to get a smaller
> net FST, I think we should offer packing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to