Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Thanks for the comments and for running the tests. @okram: 1. I also dislike HadoopPools and would prefer to remove it, but I don't know how to do it correctly, even after considering the problem for a while. Here are my open questions and observations. HadoopPools apparently exists to help Vertex/ObjectWritable implement Serializable. They must implement Serializable because they keep showing up in `spark.closure.serializer`'s workload, and only JavaSerializer can be used as `spark.closure.serializer`. A natural question here: is it reasonable and normal for graph types to appear in these closures in the first place? Spark has this nifty `spark.kryo.registrator` mechanism to support user-registered types with Kryo, and its docs talk at length about setting `spark.serializer`. The docs read as though that was all the user would need to customize to deal with custom data. `spark.closure.serializer` seems to barely be mentioned. I've only seen it in the manual's exhaustive table of config options (http://spark.apache.org/docs/latest/configuration.html). Even then, the docs merely comment that JavaSerializer is its only supported value. There's some old list traffic that says KryoSerializer is unsupported because, back when it was originally tested, it blew up on pretty much any Java closure, though that might have changed since. I wonder if this indicates some kind of usage antipattern. Could there be an alternate approach to traversal/vertexprogram evaluation in SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark closures? That would make the use-case for HadoopPools disappear. Then again, it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`. I am full of uncertainty on this point. I don't have deep enough Spark knowledge to judge whether what we're doing now is reasonable. As long as Object/VertexWritable keep popping up in closures created by SparkGraphComputer & friends, we need some way to serialize them and everything they refer to, which can apparently involve a lot of graph entities. We could theoretically implement Serializable "manually" (without Kryo/Gryo), but that seems like a massive amount of work, potentially inefficient, definitely redundant, and adds a huge new bug surface. I'm against that approach, but that's about the only thing I know for certain here. 2. In the long term, I think users should set ``` spark.serializer=KryoSerializer spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator ``` and TinkerPop should stop registering/caring about Spark and scala-runtime types in its serialization layer. This PR tries to keep GryoSerializer untouched though. That migration doesn't have to be forced. I'm already using the `spark.kryo.registrator` setting in my app, where I have an impl that registers all TinkerPop types plus all of my app's types. I can refactor that by extracting a TinkerPopRegistrator that covers the TP types (but not my app's). I probably should have done that at the start. As it stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" kit, but the very last assembly steps involving a `spark.kryo.registrator` impl are left as an exercise to the user. I'll try to introduce a TinkerPopRegistrator to address that, though I may reconsider if supporting custom type registration turns out way more complicated than I'm imagining. I think Gryo more generally is useful and can stay around indefinitely (say, for reading/writing data to the filesystem), but GryoSerializer should eventually stop being the recommended spark.serializer. If that is all GryoSerializer per se exists to do, then it could be removed and any TinkerPop-specific class registrations that it performs moved to TinkerPopRegistrator. 3. This PR should maintain full format compatibility. I deliberately tried to keep all of GryoMapper and GryoSerializer's class registrations, class IDs, and associated serializers (including StarGraphSerializer) functionally unchanged. If you see somewhere that I unintentionally changed the format, that's a bug. @spmallette: re. package organization: definitely makes sense, I'll move kryoshim under io.gryo. Please let me know if I bungled the package hierarchy anywhere else. re. path forward: I think 1 and 2 above touch on both HadoopPools and GryoSerializer.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---