[
https://issues.apache.org/jira/browse/GIRAPH-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404665#comment-13404665
]
Eli Reisman commented on GIRAPH-226:
------------------------------------
That sounds great to me, and this is by no means ready to use just a
conversation starter as it seemed easy to put together. I am wondering if an
application (a subclassed Vertex) could keep one of these around as a field and
use it, and if they did not do something to pollute the cache (like using it
one minute and using get/set the next) if it would damaged plumbing on the
lower levels of the run, or if this could be transparent to the code that does
the IO?
Anyway, if it could be used selectively per-application at the author's
discretion, and without hurting anything going on under the hood, then great,
maybe having it around as an added option would be good. If it turns out it
would be trouble, I can pull it entirely. Some runs I have done suggest it
would not, but I am not convinced of that and if its dangerous, then its a bad
idea I agree.
I did think the API should be more type specific in the function names and
expected primitives so I made a quick switch from overridden getWritable
methods where someone's double could accidentally end up a float etc. into
things like getFloatWritable() etc so it would be harder to mess this up. One
good thing is the cache throws exceptions if the types coming in are not whats
expected, so the error messages would be easy for someone to pick up their
mistake right where it happened.
> Proposal for per-Mapper caching of all Writable values using existing maven
> imports
> -----------------------------------------------------------------------------------
>
> Key: GIRAPH-226
> URL: https://issues.apache.org/jira/browse/GIRAPH-226
> Project: Giraph
> Issue Type: New Feature
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Attachments: GIRAPH-226-2.patch
>
>
> We already import the Guava library into our Maven repo (see GIRAPH-225) I
> have written a _proposed_ caching system using their very efficient
> com.google.common.cache library. It would exist as a static singleton
> per-Mapper (per JVM) and would be usable by all vertices in a given
> partition/JVM environment by inclusion of an instance field in
> BasicVertex<I,V,E,M> or perhaps as part of a Context etc. (something global
> to each JVM would do.) Through a simple API one could manipulate and
> "create/get" all Writable instances used by that JVM without duplicating
> object all the time. The net effect would be similar to the recent
> improvement to NullWritable, but would cover everything. Please see the
> patch, it does not attempt to inject this cache into its new home yet, just
> places the files in "lib/" for your review and comments.
> Experiments to come will reveal whether this is a desperately needed
> improvement or just a detail as far as Giraph scale-out is concerned, but if
> it is, here it is. One caveat (I would be happy to make the minimal changes
> to existing example code/tests and our web instructions) is that the API for
> using Writables would change slightly. All mutation and creation/aquisition
> of Writable instances would be via the cache.getWritable(), which is
> overridden to easily accept all Java types that map to Writables without any
> work for the user. In fact, using this API would eliminate the need to use
> the "new" operator with Writables in any way. Best of all, should a new user
> author an application without using the cache, it would be bloated (as now)
> memory-wise but would not break in the least. There is a better explanation
> in the code comments for GiraphWritableCache, the main file.
> One could easily upgrade this to take advantage of generics by using a
> Configuration object to init this cache, and borrowing its <I,V,E,M> class
> object for Writable instantiation, but this would require more overhead
> within the cache itself, and doesn't save much code it turns out because you
> still have to concretely implement the cache loading methods with concrete
> type params. Although the main object contains one sub-cache for each
> Java-to-Writable mapping we use in Giraph/Hadoop, they are instantiated
> lazily and in most vertex implementations would never be instantiated for
> more than 1 or 2 of the possible Writables.
> ArrayWritable is not supported yet, I will be posting a separate JIRA about
> this. It turns out, ArrayWritable does not play nice with GiraphJob.run() no
> matter how you subclass or manipulate it, and twice now vertex
> implementations of mine have had to store final values in Text or some other
> unfortunate format to express tuples. This would make Multigraphs (as is
> being discussed currently in another Jira by Allessandro) impossible unless
> fixed. Thanks to Sean Choi for pointing me toward this (I think larger)
> problem. More to follow.
> Anyway, a quick morning grep reveals no code in Giraph is using
> ArrayWritables yet anyhow, so for now this doesn't affect the cache. Please
> look at this code, read the comments about use, and please tell me what you
> think. NO biggie to be if we don't use it, but again...here it is if we want
> it. I look forward to hearing from you.
> For the record, I think it would live as a static field in
> BasicVertex<I,V,E,M> or GiraphJob, etc.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira