On 6/8/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
----- Original Message -----
From: "Doğacan Güney" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Friday, June 08, 2007 8:27 PM
Subject: Re: Loading mechnism of plugin classes and singleton objects
[...]
>> This is strange, because, as you can see below, the strings that
>> make keys and values of conf appears unchanged. Perhaps we should
>> override
>> the equals() method in org.apache.hadoop.conf.Configuration (invoked by
>> CACHE.get(), according to the specs of the java.util.Map interface), so
>> that
>> the hashCode()s of the keys get ignored, and conf1.equals(conf2) return
>> true
>> if and only if:
>>
>> 1. conf1.size() == conf2.size(),
>>
>> 2. for each key k1 of conf1 there is a key k2 in conf2 such as:
>> 2.1 k1.equals(k2)
>> 2.2 conf1.get(k1).equals(conf2.get(k2))
>
> This has been suggested before and I have to say I don't like this
> one, because this means that each call to PluginRepository.get(conf)
> will end up comparing all key value pairs, which, IMO, is
> excessive(because if I am not mistaken, we don't need this when
> running nutch in a distributed environment.). Unfortunately, this may
> be the only way to fix this leak.
Well, after all how often can this overhead ever happen? I logged just a few
calls (admittedly, on short runs) but even thousands of occurrences would
only consume a negligible amount of CPU time.
There are a couple of places where the overhead will cost us right
now. For example, current ParseSegment code around line 76:
parseResult = new ParseUtil(getConf()).parse(content);
This is called for every record. So we will end up doing a lot of
comparisons. Obviously we can fix this, but I don't like the idea that
a get method causes overhead. Anyway, that is just my personal
preference, so if people are feeling very strongly about it,
"deep-comparison" is fine with me.
A more serious problem is that an implementation of equals() that returns
true if the two hashCodes are different violates the specifications of
Object.hashCode() :
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Object.html#hashCode()
"If two objects are equal according to the equals(Object) method, then
calling the hashCode method on each of the two objects must produce the same
integer result."
We can just update Configuration.hashCode to calculate hash by summing
hashCode's of all key,value pairs. This should make it equal,
shouldn't it?
> This is probably not a good idea, but here it goes: Perhaps, we can
> change LocaJobRunner.Job.run method. First, it clones the JobConf
> object (clonedConf). It then runs a MapTask with the original
> JobConf. Upon completion, it copies everything from clonedConf back to
> original JobConf. This way, original JobConf's hashCode won't change,
> so there should be no leak.
What if we simply rework the PluginRepository.get() method, checking first
if there is key "deeply" equal (i.e., with the same key/value pairs but
possibly different hashCode) to conf is in CACHE, and in that case if the
hashCode is different we use that key as parameter of the get() method, in
place of conf? (This within a synchronized block, of course.) Probably the
cleanest way do do that is to encapsulate this behavior in a
"hashCodeObliviousGet()" method of a subclass of Hashtable, and use that
subclass for CACHE. So, we will be "specs-ly correct" because we won't
override common methods (get() or equals()) with implementations that are
not compliant with the specs.
Also: do you agree with me that we should avoid caches based on weak
references (like the current WeakHashMap) if we want to be protected against
multiple instances of plugins? I can't see a compelling reason for a
garbage-collected CACHE, considering that the number of items stored in it
can't grow that large (how many different configurations can users ever
create?)
I think so too. When a map task ends and another begins, there will be
no strong references to the configuration object of the previous map
task, so it may be garbage-collected. Nicolas Lichtmaier has a patch
for this to change WeakHashMap to a form of LRU map.
BTW, this problem has been discussed before (most recently at
http://www.nabble.com/Plugins-initialized-all-the-time ). There even
is an open issue for this - NUTCH-356. I would suggest that we move
our discussion there so that we can all work on this together and fix
this once and for all. I will update the issue with the most recent
discussions.
Enzo
--
Doğacan Güney