Eli Reisman created GIRAPH-226:
----------------------------------

             Summary: Proposal for per-Mapper caching of all Writable values 
using existing maven imports
                 Key: GIRAPH-226
                 URL: https://issues.apache.org/jira/browse/GIRAPH-226
             Project: Giraph
          Issue Type: New Feature
            Reporter: Eli Reisman
            Priority: Minor


We already import the Guava library into our Maven repo (see GIRAPH-225) I have 
written a _proposed_ caching system using their very efficient 
com.google.common.cache library. It would exist as a static singleton 
per-Mapper (per JVM) and would be usable by all vertices in a given 
partition/JVM environment by inclusion of an instance field in 
BasicVertex<I,V,E,M> or perhaps as part of a Context etc. (something global to 
each JVM would do.) Through a simple API one could manipulate and "create/get" 
all Writable instances used by that JVM without duplicating object all the 
time. The net effect would be similar to the recent improvement to 
NullWritable, but would cover everything. Please see the patch, it does not 
attempt to inject this cache into its new home yet, just places the files in 
"lib/" for your review and comments.

Experiments to come will reveal whether this is a desperately needed 
improvement or just a detail as far as Giraph scale-out is concerned, but if it 
is, here it is. One caveat (I would be happy to make the minimal changes to 
existing example code/tests and our web instructions) is that the API for using 
Writables would change slightly. All mutation and creation/aquisition of 
Writable instances would be via the cache.getWritable(), which is overridden to 
easily accept all Java types that map to Writables without any work for the 
user. In fact, using this API would eliminate the need to use the "new" 
operator with Writables in any way. Best of all, should a new user author an 
application without using the cache, it would be bloated (as now) memory-wise 
but would not break in the least. There is a better explanation in the code 
comments for GiraphWritableCache, the main file.

One could easily upgrade this to take advantage of generics by using a 
Configuration object to init this cache, and borrowing its <I,V,E,M> class 
object for Writable instantiation, but this would require more overhead within 
the cache itself, and doesn't save much code it turns out because you still 
have to concretely implement the cache loading methods with concrete type 
params. Although the main object contains one sub-cache for each 
Java-to-Writable mapping we use in Giraph/Hadoop, they are instantiated lazily 
and in most vertex implementations would never be instantiated for more than 1 
or 2 of the possible Writables.

ArrayWritable is not supported yet, I will be posting a separate JIRA about 
this. It turns out, ArrayWritable does not play nice with GiraphJob.run() no 
matter how you subclass or manipulate it, and twice now vertex implementations 
of mine have had to store final values in Text or some other unfortunate format 
to express tuples. This would make Multigraphs (as is being discussed currently 
in another Jira by Allessandro) impossible unless fixed. Thanks to Sean Choi 
for pointing me toward this (I think larger) problem. More to follow.

Anyway, a quick morning grep reveals no code in Giraph is using ArrayWritables 
yet anyhow, so for now this doesn't affect the cache. Please look at this code, 
read the comments about use, and please tell me what you think. NO biggie to be 
if we don't use it, but again...here it is if we want it. I look forward to 
hearing from you. 

For the record, I think it would live as a static field in BasicVertex<I,V,E,M> 
or GiraphJob, etc.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to