Enrico Triolo wrote:
Using a profiler (specifically, netbeans profiler) I found out that for each submitted url a new LanguageIdentifier instance is created, and never released. With the memory inspector tool I can see as many instances of LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched pages, each of them occupying about 180kb. Forcing garbage collection doesn't release much memory.
Yes, this looks like a bug. A single instance of LanguageIdentifier per task should be cached in the job "context" (i.e. Configuration instance), to avoid too many instantiations.
Since I was still having some strange results with the profiler, I added a println message in the getInstance method, to monitor effectively singleton creation. It turns out that the singleton is re-istantiated each time! I can't really understand why this is happening, maybe is something related to hadoop internals?
I remember a similar situation I had, where instance variables were not initialized after the object was created with Class.newInstance(). VM bug? not sure... I didn't track it down that time, I simply moved the variable initialization to setConf(), which solved my problem.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
