----- Original Message -----
From: "Doğacan Güney" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Friday, June 08, 2007 8:27 PM
Subject: Re: Loading mechnism of plugin classes and singleton objects
[...]
This is strange, because, as you can see below, the strings that
make keys and values of conf appears unchanged. Perhaps we should
override
the equals() method in org.apache.hadoop.conf.Configuration (invoked by
CACHE.get(), according to the specs of the java.util.Map interface), so
that
the hashCode()s of the keys get ignored, and conf1.equals(conf2) return
true
if and only if:
1. conf1.size() == conf2.size(),
2. for each key k1 of conf1 there is a key k2 in conf2 such as:
2.1 k1.equals(k2)
2.2 conf1.get(k1).equals(conf2.get(k2))
This has been suggested before and I have to say I don't like this
one, because this means that each call to PluginRepository.get(conf)
will end up comparing all key value pairs, which, IMO, is
excessive(because if I am not mistaken, we don't need this when
running nutch in a distributed environment.). Unfortunately, this may
be the only way to fix this leak.
Well, after all how often can this overhead ever happen? I logged just a few
calls (admittedly, on short runs) but even thousands of occurrences would
only consume a negligible amount of CPU time.
A more serious problem is that an implementation of equals() that returns
true if the two hashCodes are different violates the specifications of
Object.hashCode() :
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Object.html#hashCode()
"If two objects are equal according to the equals(Object) method, then
calling the hashCode method on each of the two objects must produce the same
integer result."
This is probably not a good idea, but here it goes: Perhaps, we can
change LocaJobRunner.Job.run method. First, it clones the JobConf
object (clonedConf). It then runs a MapTask with the original
JobConf. Upon completion, it copies everything from clonedConf back to
original JobConf. This way, original JobConf's hashCode won't change,
so there should be no leak.
What if we simply rework the PluginRepository.get() method, checking first
if there is key "deeply" equal (i.e., with the same key/value pairs but
possibly different hashCode) to conf is in CACHE, and in that case if the
hashCode is different we use that key as parameter of the get() method, in
place of conf? (This within a synchronized block, of course.) Probably the
cleanest way do do that is to encapsulate this behavior in a
"hashCodeObliviousGet()" method of a subclass of Hashtable, and use that
subclass for CACHE. So, we will be "specs-ly correct" because we won't
override common methods (get() or equals()) with implementations that are
not compliant with the specs.
Also: do you agree with me that we should avoid caches based on weak
references (like the current WeakHashMap) if we want to be protected against
multiple instances of plugins? I can't see a compelling reason for a
garbage-collected CACHE, considering that the number of items stored in it
can't grow that large (how many different configurations can users ever
create?)
Enzo