[
https://issues.apache.org/jira/browse/GIRAPH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nitay Joffe updated GIRAPH-684:
-------------------------------
Description:
While working on GIRAPH-683 I realized something: The python code the user has
to write is fairly cumbersome, because they cant just say setValue(4), they
have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion.
The problem is that we have a tight coupling between user types and their
serialization, so the "everything must be Writable" spreads throughout the
codebase.
I think we need to change e.g. Vertex<I extends WritableComparable, V extends
Writable, E extends Writable> to just Vertex<I, V, E>.
We store for each type a SerDe that knows how to serialize/deserialize that
type. If the user passes us a Writable then we use a WritableSerDe. This means
no changes required to existing code.
Note that the SerDe interface does not allow for using a type like Long
directly. This is by design since immutable types don't work with Giraph.
The I,V,E,M parameters, in order to get serialized, would need to adhere to one
of the following:
1) Be a type we know how to serialize, e.g. LongWritable.
2) Be Writable. The key is we don't _require_ it on the generic parameter, but
we check if it is and if so we use their code. This makes everything backwards
compatible.
3) The user has registered his own serializer. This lets them serialize
completely new types, for example a fastutil map, without having to subclass
that type to make it Writable.
With this improved API in place, all computation code (and user code in
general) would be much cleaner and simpler. It will also make things like
Jython much more intuitive.
I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10
workers. The change is insignificant: 319 seconds total time vs 311. The new
version is actually faster (but I think that is mostly just variance noise).
Here is the code: https://reviews.apache.org/r/13306/
was:
While working on GIRAPH-683 I realized something: The python code the user has
to write is fairly cumbersome, because they cant just say setValue(4), they
have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion.
The problem is that we have a tight coupling between user types and their
serialization, so the "everything must be Writable" spreads throughout the
codebase.
I think we need to change e.g. Vertex<I extends WritableComparable, V extends
Writable, E extends Writable> to just Vertex<I extends Comparable, V, E>.
We keep a Map<Class, Serializer> that tells us how to serialize classes. This
map can be initialized with things we know how to serialize, e.g. Long, Double,
and String.
So then the I,V,E,M parameters, in order to get serialized, would need to
adhere to one of the following:
1) Be a type we know how to serialize, e.g. Long.
2) Be Writable. The key is we don't _require_ it on the generic parameter, but
we check if it is and if so we use their code. This makes everything backwards
compatible.
3) The user has registered his own serializer. This lets them serialize
completely new types, for example a fastutil map, without having to subclass
that type to make it Writable.
With this improved API in place, all computation code (and user code in
general) would be much cleaner and simpler. It will also make things like
Jython much more intuitive.
> Improve Writable API
> --------------------
>
> Key: GIRAPH-684
> URL: https://issues.apache.org/jira/browse/GIRAPH-684
> Project: Giraph
> Issue Type: Bug
> Reporter: Nitay Joffe
> Assignee: Nitay Joffe
>
> While working on GIRAPH-683 I realized something: The python code the user
> has to write is fairly cumbersome, because they cant just say setValue(4),
> they have to say setValue(IntWritable(4)). This is incredibly ugly in my
> opinion.
> The problem is that we have a tight coupling between user types and their
> serialization, so the "everything must be Writable" spreads throughout the
> codebase.
> I think we need to change e.g. Vertex<I extends WritableComparable, V extends
> Writable, E extends Writable> to just Vertex<I, V, E>.
> We store for each type a SerDe that knows how to serialize/deserialize that
> type. If the user passes us a Writable then we use a WritableSerDe. This
> means no changes required to existing code.
> Note that the SerDe interface does not allow for using a type like Long
> directly. This is by design since immutable types don't work with Giraph.
> The I,V,E,M parameters, in order to get serialized, would need to adhere to
> one of the following:
> 1) Be a type we know how to serialize, e.g. LongWritable.
> 2) Be Writable. The key is we don't _require_ it on the generic parameter,
> but we check if it is and if so we use their code. This makes everything
> backwards compatible.
> 3) The user has registered his own serializer. This lets them serialize
> completely new types, for example a fastutil map, without having to subclass
> that type to make it Writable.
> With this improved API in place, all computation code (and user code in
> general) would be much cleaner and simpler. It will also make things like
> Jython much more intuitive.
> I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10
> workers. The change is insignificant: 319 seconds total time vs 311. The new
> version is actually faster (but I think that is mostly just variance noise).
> Here is the code: https://reviews.apache.org/r/13306/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira