Eli Reisman created GIRAPH-226:
----------------------------------
Summary: Proposal for per-Mapper caching of all Writable values
using existing maven imports
Key: GIRAPH-226
URL: https://issues.apache.org/jira/browse/GIRAPH-226
Project: Giraph
Issue Type: New Feature
Reporter: Eli Reisman
Priority: Minor
We already import the Guava library into our Maven repo (see GIRAPH-225) I have
written a _proposed_ caching system using their very efficient
com.google.common.cache library. It would exist as a static singleton
per-Mapper (per JVM) and would be usable by all vertices in a given
partition/JVM environment by inclusion of an instance field in
BasicVertex<I,V,E,M> or perhaps as part of a Context etc. (something global to
each JVM would do.) Through a simple API one could manipulate and "create/get"
all Writable instances used by that JVM without duplicating object all the
time. The net effect would be similar to the recent improvement to
NullWritable, but would cover everything. Please see the patch, it does not
attempt to inject this cache into its new home yet, just places the files in
"lib/" for your review and comments.
Experiments to come will reveal whether this is a desperately needed
improvement or just a detail as far as Giraph scale-out is concerned, but if it
is, here it is. One caveat (I would be happy to make the minimal changes to
existing example code/tests and our web instructions) is that the API for using
Writables would change slightly. All mutation and creation/aquisition of
Writable instances would be via the cache.getWritable(), which is overridden to
easily accept all Java types that map to Writables without any work for the
user. In fact, using this API would eliminate the need to use the "new"
operator with Writables in any way. Best of all, should a new user author an
application without using the cache, it would be bloated (as now) memory-wise
but would not break in the least. There is a better explanation in the code
comments for GiraphWritableCache, the main file.
One could easily upgrade this to take advantage of generics by using a
Configuration object to init this cache, and borrowing its <I,V,E,M> class
object for Writable instantiation, but this would require more overhead within
the cache itself, and doesn't save much code it turns out because you still
have to concretely implement the cache loading methods with concrete type
params. Although the main object contains one sub-cache for each
Java-to-Writable mapping we use in Giraph/Hadoop, they are instantiated lazily
and in most vertex implementations would never be instantiated for more than 1
or 2 of the possible Writables.
ArrayWritable is not supported yet, I will be posting a separate JIRA about
this. It turns out, ArrayWritable does not play nice with GiraphJob.run() no
matter how you subclass or manipulate it, and twice now vertex implementations
of mine have had to store final values in Text or some other unfortunate format
to express tuples. This would make Multigraphs (as is being discussed currently
in another Jira by Allessandro) impossible unless fixed. Thanks to Sean Choi
for pointing me toward this (I think larger) problem. More to follow.
Anyway, a quick morning grep reveals no code in Giraph is using ArrayWritables
yet anyhow, so for now this doesn't affect the cache. Please look at this code,
read the comments about use, and please tell me what you think. NO biggie to be
if we don't use it, but again...here it is if we want it. I look forward to
hearing from you.
For the record, I think it would live as a static field in BasicVertex<I,V,E,M>
or GiraphJob, etc.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira