[GitHub] tinkerpop issue #616: TINKERPOP-1680 Add string performance options to StarG...
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/616 I force-pushed this PR's branch. The new basis is 4f7b8595bbcd691649e63e49deb9ed0adcaa81e7 (May 28). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #616: TINKERPOP-1680 Add string performance options to StarG...
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/616 @spmallette maybe -- basis of the PR branch is b755788c6fe7b902a3e81f3c2f11eb5a8b02c4be (May 11). I'll rebase on latest tp32 and force-push the PR branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #616: TINKERPOP-1680 Add string performance options t...
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/616 TINKERPOP-1680 Add string performance options to StarGraph By default, StarGraph invokes intern() on vertex label and property key strings. It also toString()s vertex and edge IDs for comparisons and existence checks. This commit introduces a new StarGraph builder with a pair of boolean options controlling these behaviors. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1680 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/616.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #616 commit 0155665162668f68713bc4fe5a95678a6905c72b Author: Dan LaRocque Date: 2017-04-04T23:41:29Z TINKERPOP-1680 Add string performance options to StarGraph By default, StarGraph invokes intern() on vertex label and property key strings. It also toString()s vertex and edge IDs for comparisons and existence checks. This commit introduces a new StarGraph builder with a pair of boolean options controlling these behaviors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #497: Limit JVM system props passed around in Spark j...
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/497 Limit JVM system props passed around in Spark jobs Prior to this commit, SGC indiscriminately set the entire job config as JVM system properties on Spark executors. It didn't account for the fact that some config values (e.g. spark.job.description) could have spaces. A value with a space wouldn't get quoted. This led to Spark workers failing to start, because part of the unquoted value would be erroneously interpreted as the JVM main class. This commit makes SGC only pass two config settings as system props: * gremlin.io.registry * gremlin.io.kryoShimService You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1389-jvm-arg-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #497 commit 69877a20ceb9156ecf8ed7062ba3b64f56ab9134 Author: Dan LaRocque Date: 2016-11-17T16:33:58Z Limit JVM system props passed around in Spark jobs Prior to this commit, SGC indiscriminately set the entire job config as JVM system properties on Spark executors. It didn't account for the fact that some config values (e.g. spark.job.description) could have spaces. A value with a space wouldn't get quoted. This led to Spark workers failing to start, because part of the unquoted value would be erroneously interpreted as the JVM main class. This commit makes SGC only pass two config settings as system props: * gremlin.io.registry * gremlin.io.kryoShimService --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #462: TINKERPOP-1525 Avoid starting VP worker iterations tha...
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/462 I made a version of this change for TINKERPOP-1389 as #466. That one's optional, but I thought merging this into TINKERPOP-1389 might help ensure this fix does not get lost in the eventual possibly-conflict-y and confusing merge of TINKERPOP-1389 to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #466: TINKERPOP-1525 Avoid starting VP worker iterati...
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/466 TINKERPOP-1525 Avoid starting VP worker iterations that never end (Spark 2.0 version) This is exactly like #462, except that it tracks a change except it tracks a switch between Spark 1.6 and 2.0 away from functions that manipulate iterables to those that manipulate iterators. Assuming #462 eventually gets into master, and assuming that TINKERPOP-1389 eventually merges with master, the second merge will conflict. It still seems marginally safer to make this change in parallel on TINKERPOP-1389 and master/tp32 than just the latter, since the conflict will look more like "oh i better keep one of these two almost-identical edge-case checks" than "oh the Spark 1.x branch had some silly edge case check that I can just delete for 2.0". You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1525-for-TINKERPOP-1389 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/466.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #466 commit 2a8f741190beebd7b7e6a9ff7922afb9b6807fa5 Author: Dan LaRocque Date: 2016-10-26T00:37:17Z Avoid starting VP worker iterations that never end SparkExecutor.executeVertexProgramIteration was written in such a way that an empty RDD partition would cause it to invoke VertexProgram.workerIterationStart without ever invoking VertexProgram.workerIterationEnd. This seems like a contract violation. I have at least one VP that relies on workerIterationStart|End to allocate and release resources. Failing to invoke End like this causes a leak in that VP, as it would for any VP that uses that resource management pattern. (cherry picked from commit 36e1159a80f539b8bd4a884e5c1cf304ec52c4f9; this is the same change, except it tracks a switch between Spark 1.6 and 2.0 away from functions that manipulate iterables to those that manipulate iterators) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #462: TINKERPOP-1525 Avoid starting VP worker iterations tha...
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/462 I force-pushed this branch with a fix (`emptyIterator -> emptyList`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #462: TINKERPOP-1525 Avoid starting VP worker iterations tha...
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/462 Marko's right, I wrote and tested this against TINKERPOP-1389 (before the latest rebase), then cherry-picked against current master without retesting. Sorry about that. He's also right about the Iterator/Iterable change. I'll change it sometime today, force-push the branch, and comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #462: TINKERPOP-1525 Avoid starting VP worker iterati...
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/462 TINKERPOP-1525 Avoid starting VP worker iterations that never end SparkExecutor.executeVertexProgramIteration was written in such a way that an empty RDD partition would cause it to invoke VertexProgram.workerIterationStart without ever invoking VertexProgram.workerIterationEnd. This seems like a contract violation. I have at least one VP that relies on workerIterationStart|End to allocate and release resources. Failing to invoke End like this causes a leak in that VP, as it would for any VP that uses that resource management pattern. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1525-tp32 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/462.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #462 commit 1fc6ec5d7c20d6163c61f84198e3e68457e23eae Author: Dan LaRocque Date: 2016-10-21T21:04:30Z Avoid starting VP worker iterations that never end SparkExecutor.executeVertexProgramIteration was written in such a way that an empty RDD partition would cause it to invoke VertexProgram.workerIterationStart without ever invoking VertexProgram.workerIterationEnd. This seems like a contract violation. I have at least one VP that relies on workerIterationStart|End to allocate and release resources. Failing to invoke End like this causes a leak in that VP, as it would for any VP that uses that resource management pattern. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/372 I'm closing this not because the conversation is settled and everything's done, but because this PR is superseded by #377. We can keep discussing here if you like. That's probably easier than juggling three PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #372: DSP-1397 Fix StarGraph addEdge for self-edges
Github user dalaro closed the pull request at: https://github.com/apache/tinkerpop/pull/372 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/372 I opened two PRs, (one for tp31 as Stephen requested, and a new one for master): * #377 (master) * #378 (tp31) As far as I can tell, GraphFilter did not exist in 3.1, and it may not even be possible to induce the buggy traversal behavior describe in #372 (which was all based on master / 3.2). The underlying datastructure corruption -- putting a StarOutEdge into the inEdges map -- does exist on 3.1, and @okram's addEdge fix applies cleanly on 3.1, but without applyGraphFilter, I'm not sure it even matters for correctness on 3.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #378: TINKERPOP-1397 Fix StarGraph.addEdge
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/378 TINKERPOP-1397 Fix StarGraph.addEdge See: * old PR #372 * JIRA issue https://issues.apache.org/jira/browse/TINKERPOP-1397 As far as I can tell, GraphFilter did not exist in 3.1, and it may not even be possible to induce the buggy traversal behavior describe in #372 (which was all based on master / 3.2). The underlying datastructure corruption -- putting a StarOutEdge into the `inEdges` map -- does exist on 3.1, and @okram's `addEdge` fix applies cleanly on 3.1, but without `applyGraphFilter`, I'm not sure it even matters for correctness on 3.1. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1397-tp31 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/378.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #378 commit 0022b7f6be25eb7d3c778b137beb6e8a7d2784ca Author: Dan LaRocque Date: 2016-08-10T22:52:13Z TINKERPOP-1397 Fix StarGraph.addEdge For self-loops, StarGraph.addEdge used to put a single StarOutEdge into both its inEdges and outEdges maps, potentially causing problems in applyGraphFilter. This change makes StarGraph.addEdge put the appropriate type of edge (Star(Out/In)Edge) in the associated map. The IDs for each edge instance are kept in agreement. This change is @okram's, who suggested it in PR #372. I merely reviewed it and added a couple of comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #377: TINKERPOP-1397 StarGraph filtering with self ed...
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/377 TINKERPOP-1397 StarGraph filtering with self edges See: * old PR #372 * JIRA issue https://issues.apache.org/jira/browse/TINKERPOP-1397 You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1397 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/377.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #377 commit 0022b7f6be25eb7d3c778b137beb6e8a7d2784ca Author: Dan LaRocque Date: 2016-08-10T22:52:13Z TINKERPOP-1397 Fix StarGraph.addEdge For self-loops, StarGraph.addEdge used to put a single StarOutEdge into both its inEdges and outEdges maps, potentially causing problems in applyGraphFilter. This change makes StarGraph.addEdge put the appropriate type of edge (Star(Out/In)Edge) in the associated map. The IDs for each edge instance are kept in agreement. This change is @okram's, who suggested it in PR #372. I merely reviewed it and added a couple of comments. commit 7842c4e7e2e3c2b33fc1bca7a8a80e165080bc59 Author: Dan LaRocque Date: 2016-08-10T22:59:54Z TINKERPOP-1397 Add StarGraph bothE filtering test commit b4aaa69b9c10a61c7ab2f7476c90e4728ef032e8 Author: Dan LaRocque Date: 2016-08-10T23:24:10Z Merge branch 'TINKERPOP-1397-tp31' into TINKERPOP-1397 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/372 I pulled master (@ fc79de813c788ee141175bbf1ea6d0cdefa94a0f) and confirmed that the bug still existed. I didn't apply your proposed change right away. Here a `gremlin.sh` session that shows the buggy behavior. I'm also pushing a unit test that I developed first, but it's easier to see what's going wrong in this `gremlin.sh` session than in the source code for a unit test. ``` $ `find gremlin-console/target -name gremlin.sh` \,,,/ (o o) -oOOo-(3)-oOOo- plugin activated: tinkerpop.server plugin activated: tinkerpop.utilities plugin activated: tinkerpop.tinkergraph gremlin> sg = org.apache.tinkerpop.gremlin.structure.util.star.StarGraph.open() ==>stargraph[starOf:null] gremlin> v = sg.addVertex('name', 'foo') ==>v[0] gremlin> v.addEdge('self', v) ==>e[2][0-self->0] gremlin> sg.traversal().V().inE('self') ==>e[2][0-self->0] gremlin> sg.traversal().V().outE('self') ==>e[2][0-self->0] gremlin> sg.traversal().V().bothE('self') ==>e[2][0-self->0] ==>e[2][0-self->0] // NOTE: StarOutEdge in the inEdges datastructure gremlin> v.inEdges.values().iterator().next().get(0).getClass() ==>class org.apache.tinkerpop.gremlin.structure.util.star.StarGraph$StarOutEdge gremlin> filter = new org.apache.tinkerpop.gremlin.process.computer.GraphFilter() ==>graphfilter[none] gremlin> filter.setEdgeFilter(__.bothE("self")) ==>null gremlin> sg.applyGraphFilter(filter) ==>Optional[stargraph[starOf:v[0]]] // NOTE: no inE results, double outE results after GraphFilter is applied gremlin> sg.traversal().V().inE('self') gremlin> sg.traversal().V().outE('self') ==>e[2][0-self->0] ==>e[2][0-self->0] gremlin> sg.traversal().V().bothE('self') ==>e[2][0-self->0] ==>e[2][0-self->0] gremlin> ``` There are a couple of things worth noting above: * the `inEdges` datastructure contains a `StarOutEdge` when we construct the graph this way * the final post-filtering `inE` and `outE` are incorrect (an indirect consequence of the first point) These steps produce different behavior depending on whether I invoke `StarGraph.open()` or call `StarGraph.of()`. If I do the latter, I get a `StarInEdge` in the `inEdges` map and a `StarOutEdge` in the `outEdges` map, and in turn avoid this particular filtering bug. **Good news**: your change makes sense and fixes the bug I initially reported (both-filtering). **Bad news**: I think there are more graph filtering bugs on StarGraph that can create half-edges When I filter a self-loop star graph for either in or out, I get stuff like this (symmetrical for `.inE`): ``` gremlin> filter = new org.apache.tinkerpop.gremlin.process.computer.GraphFilter() ==>graphfilter[none] gremlin> filter.setEdgeFilter(__.outE("self")) ==>null gremlin> sg.applyGraphFilter(filter) ==>Optional[stargraph[starOf:v[0]]] gremlin> sg.traversal().V() ==>v[0] gremlin> sg.traversal().V().inE('self') gremlin> sg.traversal().V().outE('self') ==>e[2][0-self->0] gremlin> sg.traversal().V().bothE('self') ==>e[2][0-self->0] gremlin> ``` @okram: This doesn't look correct, but then again, I don't really know what it should do. Should it completely destroy the self-loop, so that inE/bothE/outE all fail to emit anything associated with the self-loop? Should it transform the self-loop into a single-directional edge (which seems like one interpretation for what it does now, in which case it might be correct)? I noticed that the name of my original branch is nonsense ("DSP-1397"). I don't think that can be changed after a PR is created. So, I'm opening a pair of new PRs: one for tp31 and one for master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/372 Thanks for the replies. @spmallette I will investigate whether this behavior also exists on 3.1 and retarget if so. I think this will involve making a standalone test instead of just testing TP against my app and stimulating the failure that way. @okram Yes, I realize that bothE() should return two edges. The problem this PR tried to address was that, after applying an inE graph filter to a self-looped StarVertex, the StarVertex had two edges in its outEdges map and no edges in its inEdges map (should be one each). I think this is not just some obscure internals problem, because I discovered it through a traversal. Maybe I can articulate this better with a test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges
Github user dalaro commented on the issue: https://github.com/apache/tinkerpop/pull/372 @okram also, thanks for the code; I'll have a closer look at it when I iterate on this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #372: DSP-1397 Fix StarGraph addEdge for self-edges
GitHub user dalaro opened a pull request: https://github.com/apache/tinkerpop/pull/372 DSP-1397 Fix StarGraph addEdge for self-edges StarGraph has directionally-specific classes for its edges: * StarInEdge * StarOutEdge StarIn(/Out)Edges belong only in a StarVertex's in(/out)Edges map. We see this assumption in StarVertex.applyGraphFilter, which invokes `graphFilter.legalEdges(this)` and then invokes `instanceof StarOutEdge` on each legal edge to determine its direction. This assumption was violated by the StarVertex.addEdge method. This method special-cases self-edge addition. This method first unconditionally adds a StarOutEdge. Then, it checks to see whether its parameters indicate that it is adding a self-edge. If this is so, the method proceeds to add that same StarOutEdge instance to the inEdges map. This is the problem. Adding a self-edge to a StarVertex results in one StarOutEdge instance that belongs to both the inEdges and outEdges map, which later causes the `applyGraphFilter` method's `instanceof` checks to yield two out edges and no in edges on a self-loop. This commit changes StarVertex so that self-edges are represented by a new class, StarSelfEdge. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop DSP-1397 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/372.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #372 commit dddc44369e2c645a854b2a7bed0f4af51acd6fc5 Author: Dan LaRocque Date: 2016-08-04T04:41:44Z DSP-1397 Fix StarGraph addEdge for self-edges StarGraph has directionally-specific classes for its edges: * StarInEdge * StarOutEdge StarIn(/Out)Edges belong only in a StarVertex's in(/out)Edges map. We see this assumption in StarVertex.applyGraphFilter, which invokes `graphFilter.legalEdges(this)` and then invokes `instanceof StarOutEdge` on each legal edge to determine its direction. This assumption was violated by the StarVertex.addEdge method. This method special-cases self-edge addition. This method first unconditionally adds a StarOutEdge. Then, it checks to see whether its parameters indicate that it is adding a self-edge. If this is so, the method proceeds to add that same StarOutEdge instance to the inEdges map. This is the problem. Adding a self-edge to a StarVertex results in one StarOutEdge instance that belongs to both the inEdges and outEdges map, which later causes the `applyGraphFilter` method's `instanceof` checks to yield two out edges and no in edges on a self-loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tinkerpop pull request #:
Github user dalaro commented on the pull request: https://github.com/apache/tinkerpop/commit/f9706a38fbe546c0a7e5b038a6b99303d21aec01#commitcomment-18144496 @okram i don't understand the failure this caused, but the change seems innocuous to me... no objections --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Just a reminder: we are voting on branch https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321. That subsumes the PR and adds a combination of fixes from Marko and me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 These questions involved a lot of back-and-forth that would have been slow to resolve on github comments, so and I discussed the failures and the code on a hangout earlier today. Summary: * Marko created https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321 on the TP repo. This branch is my PR plus additional tweaks Marko and I worked out on top. His branch supersedes this PR branch now. This PR branch **should not be merged because Marko's branch subsumes/extends it**. * I have a rough idea of what changed as we discussed code on the hangout, but I will go through his branch's new commits and review. * The new serialization code is lower-priority than the old code. It has to be turned on through explicit user action, like this is the test provider code: ``` System.setProperty(SHIM_CLASS_SYSTEM_PROPERTY, UnshadedKryoShimService.class.getCanonicalName()); config.put("spark.serializer", KryoSerializer.class.getCanonicalName()); config.put("spark.kryo.registrator", GryoRegistrator.class.getCanonicalName()); ``` * Marko setup SparkHadoopGraphProvider to randomly choose either the old serialization code or the new stuff in https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-afffc6796745845d46e6f60ea3a992f9R91. IMHO that should be overrideable to make it deterministic, but it's not a huge deal since it's limited to test code. * We replaced `GryoClassResolver` with a series of class registrations in https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-2e881d0a6ab7ed1ca7d1f593bceadfd2R207. The idea here is to explicitly configure which types need a detaching serializer rather than relying on `GryoClassResolver` to automagically do it regardless of explicit class registrations. Three factors influence this choice: * Spark's KryoSerializer does not support a user-provided ClassResolver, and working around that limitation entails either getting a change through Spark upstream or copying and pasting a bunch of KryoSerializer's private/hardcoded class registrations, some of which could change in future Spark versions. The latter is ugly and would make the code brittle across Spark upgrades. * `GryoClassResolver` is convenient but somewhat mysterious to third parties, who are likelier to have seen a custom class registration mechanism (like TP's `IoRegistry` or Spark's `spark.kryo.registrator`) than a custom resolver. It also removes vendor control in that its detachment behavior is not configurable, so modifying it requires subclassing or reimplementation. For instance, if a vendor had a lightweight Vertex implementation with a custom serializer and wanted to bypass TP's resolver-driven detachment, then I don't think it would be possible without modifying TP's resolver. * `GryoClassResolver` is written against a bunch of shaded types and relies on `readVarInt`/`writeVarInt`, which were added in Kryo 2.22, one version too late for compatibility with Spark's Kryo 2.21. This is the least important concern of the three, more an implementation detail. * Marko is running the full integration test suite now. He ran the spark ITs just before pushing and reported success on those specifically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I saw the `shouldHandleMonthDay` test failure on CI and pushed a fix. I carelessly broke the read leg of that serializer while doing some tedious, boilerplate-y method signature changes to port a batch a serializers to the shim to support IoRegistry unification. The one-liner commit that I just pushed fixed the problem locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 I just pushed some changes that I hacked together this weekend. The key additions are: * `TinkerPopKryoRegistrator`, which I extracted from my app, and which acts as a `spark.kryo.registrator` impl that knows about TinkerPop types * `IoRegistryAwareKryoSerializer`, which is a Spark `Serializer` that looks for `GryoPool.CONFIG_IO_REGISTRY` and applies it if present * `KryoShimLoaderService.applyConfiguration(cfg)`, which replaces direct calls to `HadoopPools.initialize(cfg)` and adds equivalent functionality for initializing the unshaded Kryo serializer pool The user would theoretically just set ``` spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.IoRegistryAwareKryoSerializer spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.TinkerPopKryoRegistrator # Optional, only needed for custom types gremlin.io.registry=whatever.user.IoRegistryImpl ``` In practice, when I have a custom gremlin.io.registry, I have always had to take the additional step (long before this PR) of forcibly initializing `HadoopPools` before touching SparkGraphComputer in my app, or else some part of Spark -- I think the closure serializer -- would attempt to use HadooPools via ObjectWritable/VertexWritable before initialization and produce garbage on my custom classes. **This problem predates my PR**. I'm not trying to solve it here, in part because I still don't know if it's a pathology specific to my app or because TinkerPop is missing a crucial `HadoopPools.initialize` (now, equivalently, `KryoShimLoaderService.applyConfiguration`) call somewhere, and in part because HadoopPools is such a hideous architectural wart that the ultimate solution probably involves destroying it. In the past, I've worked around this by defining a custom spark.serializer that delegates newKryo() to a GryoSerializer/IoRegistryAwareSerializer, but which has a constructor that invokes `HadoopPools.initialize`/`KryoShimLoaderService.applyConfiguration` (relying on that method's idempotence). Again, this initialization step just be specific to my app and unnecessary for the average TinkerPop user. It's possible that the config I pasted above will work for others. FWIW, this passes, so the overrides bug should be fixed along with all this refactoring stuff: ``` mvn clean install -DskipTests=true && mvn verify -pl gremlin-server -DskipIntegrationTests=false -Dtest.single=GremlinResultSetIntegrateTest ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Additional notes from Marko: > okay, so as long as there is never a need for MyRegistrar, then can you please update your PR accordingly? > Finally, call it GryoRegistrar (not TinkerPopRegistrar) as this has to do with Gryo IORegistry classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 After discussion on Slack, I think @okram and I tentatively agreed to proceed with this PR after I do additional work to save users who have custom serializers the effort of maintaining separate `IoRegistry` and `spark.kryo.registrator` implementations with near-identical contents. I may discover other complications during the process, but I think this involves two changes: 1. I will attempt to subclass KryoSerializer so that I can access the SparkConf passed to its constructior and check for `GryoPool.CONFIG_IO_REGISTRY` (similar to what GryoSerializer does now), then apply any registrations found therein so long as each registration either: * specifies no explicit serializer (using Kryo's internal default) or * specifies a shim serializer In particular, if the registry contains an old-style TP shaded Kryo serializer that hasn't been ported to the shim, then the KryoSerializer subclass will throw an exception. This change is necessary to support custom-serialized, `IoRegistry`-listed datatypes that appear in most Spark data outside of closures (say, in the RDD itself). 2. I will replace current callsites of `HadoopPools.initialize(conf)` with some other static method that calls `HadoopPools.initialize(conf)` and then calls some roughly equivalent `initialize(conf)` method for the static KryoPool backing `KryoPoolShimService`. This change is necessary to support custom-serialized, `IoRegistry`-listed datatypes that appear in closures. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 The problem is that GryoSerializer doesn't actually work. This is not the first time it has shown behavior on Spark or scala runtime classes which diverges from the behavior of Spark's KryoSerializer/JavaSerializer. I realize that asking users to write custom serializers against the shim (to be compatible with everything) in the long term or to duplicate registrations in the short term (if they want to use Gryo in one place and Spark graph computation in another, using the same custom types) is not ideal, but I think the status quo is even worse. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 BTW, in case you're wondering why I think it's hard to create proxies, recall that o.a.t.shaded.kryo.{Kryo,io.Input,io.Output} are all classes, not interfaces. There are a lot of approaches that would make this work, but the ones I can think of suck for various reasons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 To do that cleanly, users would have to write serializers against the shim interface. Of course, the shim is only relevant to users writing new serializers or willing to refactor their old ones. I see no clean way to make this work with existing serializers that users have hypothetically already compiled against TP's shaded Kryo. To make a user serializer written for TP's shaded Kryo link against Spark's unshaded Kryo is technically possible with some degree of skulduggery, it's just that all of the ways to do it that I'm aware of are unspeakably ugly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 @spmallette You're right about overrides being the problem. The last thing I did before opening this PR was to rebase on latest master, when I encountered conflicts solely involving the GryoMapper registration override feature added in d7684757734732f39970265a425b921a48d8f0fb. I attempted to resolve those conflicts too hastily and got it wrong. I went back to Stephen's original commit reworked my code around registration overrides, and I now think it conforms to the old behavior precedent. Specifically, when handling overrides of GryoMapper type registrations, I departed from the precedent set by Stephen's commit in two ways, both of which I've fixed locally: * Override registrations allocated a new registration ID. Your commit used the old ID from the registration being overridden. * As a side effect of the preceding point, I incremented the shared `currentSerializationId` for all registration calls. Your commit incremented this shared counter only for non-override registration calls. The test passes locally for me now. I'm still doing some registrator refactoring and haven't pushed yet though. @okram The problem with custom registrations is that GryoMapper allows the user to provide a serializer written against shaded Kryo. This is not compatible with Spark's KryoSerializer. I don't see a way to make it compatible without putting fake `org.apache.tinkerpop.shaded.kryo.*` classes on the classpath, which could get pretty ugly. Instead, I'm adding a registrator that includes: * all default mappings from GryoMapper (I am changing the shaded-Kryo-serializers to be shim serializers so that they are reusable) * the TinkerPop classes registered in GryoSerializer's constructors (but not the scala and Spark classes registered in there) If the user wants to register custom types while using KryoSerializer, they can extend the registrator in a subclass, then set `spark.kryo.registrator` to the subclass. The interface in question is `KryoRegistrator` and it has only one method, `registerClasses(Kryo)`. It's about as simple -- maybe even less so -- than implementing TP's `IoRegistry` interface, which has two methods. If the user decides to continue using GryoSerializer, they can also continue using `GryoPool.CONFIG_IO_REGISTRY`, which should still affect GryoSerializer/HadoopPools as it always has. WDYT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...
Github user dalaro commented on the issue: https://github.com/apache/incubator-tinkerpop/pull/325 Thanks for the comments and for running the tests. @okram: 1. I also dislike HadoopPools and would prefer to remove it, but I don't know how to do it correctly, even after considering the problem for a while. Here are my open questions and observations. HadoopPools apparently exists to help Vertex/ObjectWritable implement Serializable. They must implement Serializable because they keep showing up in `spark.closure.serializer`'s workload, and only JavaSerializer can be used as `spark.closure.serializer`. A natural question here: is it reasonable and normal for graph types to appear in these closures in the first place? Spark has this nifty `spark.kryo.registrator` mechanism to support user-registered types with Kryo, and its docs talk at length about setting `spark.serializer`. The docs read as though that was all the user would need to customize to deal with custom data. `spark.closure.serializer` seems to barely be mentioned. I've only seen it in the manual's exhaustive table of config options (http://spark.apache.org/docs/latest/configuration.html). Even then, the docs merely comment that JavaSerializer is its only supported value. There's some old list traffic that says KryoSerializer is unsupported because, back when it was originally tested, it blew up on pretty much any Java closure, though that might have changed since. I wonder if this indicates some kind of usage antipattern. Could there be an alternate approach to traversal/vertexprogram evaluation in SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark closures? That would make the use-case for HadoopPools disappear. Then again, it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`. I am full of uncertainty on this point. I don't have deep enough Spark knowledge to judge whether what we're doing now is reasonable. As long as Object/VertexWritable keep popping up in closures created by SparkGraphComputer & friends, we need some way to serialize them and everything they refer to, which can apparently involve a lot of graph entities. We could theoretically implement Serializable "manually" (without Kryo/Gryo), but that seems like a massive amount of work, potentially inefficient, definitely redundant, and adds a huge new bug surface. I'm against that approach, but that's about the only thing I know for certain here. 2. In the long term, I think users should set ``` spark.serializer=KryoSerializer spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator ``` and TinkerPop should stop registering/caring about Spark and scala-runtime types in its serialization layer. This PR tries to keep GryoSerializer untouched though. That migration doesn't have to be forced. I'm already using the `spark.kryo.registrator` setting in my app, where I have an impl that registers all TinkerPop types plus all of my app's types. I can refactor that by extracting a TinkerPopRegistrator that covers the TP types (but not my app's). I probably should have done that at the start. As it stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" kit, but the very last assembly steps involving a `spark.kryo.registrator` impl are left as an exercise to the user. I'll try to introduce a TinkerPopRegistrator to address that, though I may reconsider if supporting custom type registration turns out way more complicated than I'm imagining. I think Gryo more generally is useful and can stay around indefinitely (say, for reading/writing data to the filesystem), but GryoSerializer should eventually stop being the recommended spark.serializer. If that is all GryoSerializer per se exists to do, then it could be removed and any TinkerPop-specific class registrations that it performs moved to TinkerPopRegistrator. 3. This PR should maintain full format compatibility. I deliberately tried to keep all of GryoMapper and GryoSerializer's class registrations, class IDs, and associated serializers (including StarGraphSerializer) functionally unchanged. If you see somewhere that I unintentionally changed the format, that's a bug. @spmallette: re. package organization: definitely makes sense, I'll move kryoshim under io.gryo. Please let me know if I bungled the package hierarchy anywhere else. re. path forward: I think 1 and 2 above touch on both HadoopPools and GryoSerializer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feat
[GitHub] incubator-tinkerpop pull request #325: TINKERPOP-1321 Introduce Kryo shim to...
GitHub user dalaro opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/325 TINKERPOP-1321 Introduce Kryo shim to support serializer reuse https://issues.apache.org/jira/browse/TINKERPOP-1321 You can merge this pull request into a Git repository by running: $ git pull https://github.com/dalaro/incubator-tinkerpop serialization-flexibility Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/325.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #325 commit ef52869788ebf5b8b825f78ef21e1d38423d9aa0 Author: Dan LaRocque Date: 2016-06-02T07:09:29Z Introduce Kryo shim to support serializer reuse --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---