[
https://issues.apache.org/jira/browse/LUCENE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196042#comment-13196042
]
Michael McCandless commented on LUCENE-3725:
--------------------------------------------
Thanks Dawid.
bq. Yes, this is pretty much the top-n nodes reordering that I did, albeit
without any optimization of how many n to take (the hardcoded magic constants
should probably be justified somehow? Or replaced by a default in FST
somewhere?).
bq. Deciding how many nodes to reorder is I think hard – I failed to provide
any sensible fast heuristic for that and I simply do a simulated annealing to
find a local optimum.
Yeah the algo is simplistic now... and it does force caller to pick the params
(though minInCountDeref=3 and maxDerefNodes=Inf are probably pretty good)... we
can and should make it more sophisticated over time. We have at least one
spare bit to still use in the arc flags :)
bq. One thing I was wondering is why you decided to integrate the packer with
the fst – wouldn't it be cleaner to separate packing from construction?
Granted, this requires a double pass over the fst nodes and more intermediate
memory but it wouldn't add any more complexity to the builder which is already,
ehm, a bit complex . You can compare this design in Morfologik:
Well... it's tricky. First, I decided to change node targets to ords so that
pack could use an array to (relatively!) efficiently hold node data, eg
inCount, newAddress, etc. That required making the first pass FST "modal"
(deref'd or not). If we didn't do this then the packer would have to use even
less RAM efficient data structures (eg Map<Int,X>) I think?
Second, the format written by the packer is tightly coupled with the FST
reading, ie there are sizable differences when reading packed vs unpacked FST.
But I agree it'd be cleaner if we could move packing out (eg Util.pack), and
more strongly decouple packing from FST format...
One thing I'd really like to somehow do is create different classes for FST
writing vs reading, and different classes for each format. We now have
write-ord, write-non-ord, read-packed, read-unpacked (and even the two readers
have 3 different modes depending on INPUT_TYPE).
> Add optional packing to FST building
> ------------------------------------
>
> Key: LUCENE-3725
> URL: https://issues.apache.org/jira/browse/LUCENE-3725
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3725.patch, LUCENE-3725.patch, LUCENE-3725.patch,
> Perf.java
>
>
> The FSTs produced by Builder can be further shrunk if you are willing
> to spend highish transient RAM to do so... our Builder today tries
> hard not to use much RAM (and has options to tweak down the RAM usage,
> in exchange for somewhat lager FST), even when building immense FSTs.
> But for apps that can afford highish transient RAM to get a smaller
> net FST, I think we should offer packing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]