[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

dalaro Thu, 02 Jun 2016 20:52:54 -0700

Github user dalaro commented on the issue:

    https://github.com/apache/incubator-tinkerpop/pull/325
  
    Thanks for the comments and for running the tests.
    
    @okram:
    
    1. I also dislike HadoopPools and would prefer to remove it, but I don't 
know how to do it correctly, even after considering the problem for a while.  
Here are my open questions and observations.
    
       HadoopPools apparently exists to help Vertex/ObjectWritable implement 
Serializable.  They must implement Serializable because they keep showing up in 
`spark.closure.serializer`'s workload, and only JavaSerializer can be used as 
`spark.closure.serializer`.  A natural question here: is it reasonable and 
normal for graph types to appear in these closures in the first place?
    
       Spark has this nifty `spark.kryo.registrator` mechanism to support 
user-registered types with Kryo, and its docs talk at length about setting 
`spark.serializer`.   The docs read as though that was all the user would need 
to customize to deal with custom data.  `spark.closure.serializer` seems to 
barely be mentioned.  I've only seen it in the manual's exhaustive table of 
config options (http://spark.apache.org/docs/latest/configuration.html).  Even 
then, the docs merely comment that JavaSerializer is its only supported value.  
There's some old list traffic that says KryoSerializer is unsupported because, 
back when it was originally tested, it blew up on pretty much any Java closure, 
though that might have changed since.
    
       I wonder if this indicates some kind of usage antipattern.  Could there 
be an alternate approach to traversal/vertexprogram evaluation in 
SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark 
closures?  That would make the use-case for HadoopPools disappear.  Then again, 
it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`.  
I am full of uncertainty on this point.  I don't have deep enough Spark 
knowledge to judge whether what we're doing now is reasonable.
    
       As long as Object/VertexWritable keep popping up in closures created by 
SparkGraphComputer & friends, we need some way to serialize them and everything 
they refer to, which can apparently involve a lot of graph entities.  We could 
theoretically implement Serializable "manually" (without Kryo/Gryo), but that 
seems like a massive amount of work, potentially inefficient, definitely 
redundant, and adds a huge new bug surface.  I'm against that approach, but 
that's about the only thing I know for certain here.
    
    2. In the long term, I think users should set
    
       ```
    spark.serializer=KryoSerializer
    spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator
    ```
       and TinkerPop should stop registering/caring about Spark and 
scala-runtime types in its serialization layer.  This PR tries to keep 
GryoSerializer untouched though.  That migration doesn't have to be forced.
    
      I'm already using the `spark.kryo.registrator` setting in my app, where I 
have an impl that registers all TinkerPop types plus all of my app's types.  I 
can refactor that by extracting a TinkerPopRegistrator that covers the TP types 
(but not my app's).  I probably should have done that at the start.  As it 
stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" 
kit, but the very last assembly steps involving a `spark.kryo.registrator` impl 
are left as an exercise to the user.  I'll try to introduce a 
TinkerPopRegistrator to address that, though I may reconsider if supporting 
custom type registration turns out way more complicated than I'm imagining.
    
      I think Gryo more generally is useful and can stay around indefinitely 
(say, for reading/writing data to the filesystem), but GryoSerializer should 
eventually stop being the recommended spark.serializer.  If that is all 
GryoSerializer per se exists to do, then it could be removed and any 
TinkerPop-specific class registrations that it performs moved to 
TinkerPopRegistrator.
    
    3.  This PR should maintain full format compatibility.  I deliberately 
tried to keep all of GryoMapper and GryoSerializer's class registrations, class 
IDs, and associated serializers (including StarGraphSerializer) functionally 
unchanged.  If you see somewhere that I unintentionally changed the format, 
that's a bug.
    
    @spmallette:
    
    re. package organization: definitely makes sense, I'll move kryoshim under 
io.gryo.  Please let me know if I bungled the package hierarchy anywhere else.
    
    re. path forward: I think 1 and 2 above touch on both HadoopPools and 
GryoSerializer.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

Reply via email to