Lubos Kosco <[email protected]> writes: > Great job guys, looks awesome :) > such optimizations would be needed for other analyzers as well ... > e.g. in one of the Dougs pictures there were other analyzers taking up > a lot of memory too ...
Right, the way the analyzers work currently is sub-optimal in many ways when it comes to memory usage: 1) They're optimized to minimize reading from disk. But because we do two passes per file (one to add it to the Lucene index and one to write its xref to disk), the entire file needs to be kept in memory to avoid reading it again in the second pass. 2) Each analyzer instance preserves its read buffer for reuse. Because of (1) the read buffer will grow to fit the largest file the analyzer has seen. So once a big file is seen, the read buffer will grow and never shrink again. 3) The analyzers are cached in thread locals, so the total size of the read buffers may grow up to (number of indexer threads * number of analyzer classes * size of the largest file). Some possible experiments we could do and see how they affect both memory consumption and indexer performance: a) Read the file from disk again when generating the xref. Then the analysis phase doesn't need to make the read buffer fit the entire file, so we avoid these enormous read buffers. b) Remove the caching of analyzer instances in thread locals. Instead generate a fresh instance when needed. Then the unnecessarily big read buffers can be garbage collected. -- Knut Anders _______________________________________________ opengrok-dev mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/opengrok-dev
