[GitHub] tinkerpop issue #616: TINKERPOP-1680 Add string performance options to StarG...

2017-05-30 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/616
  
I force-pushed this PR's branch.  The new basis is 
4f7b8595bbcd691649e63e49deb9ed0adcaa81e7 (May 28).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #616: TINKERPOP-1680 Add string performance options to StarG...

2017-05-30 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/616
  
@spmallette maybe -- basis of the PR branch is 
b755788c6fe7b902a3e81f3c2f11eb5a8b02c4be (May 11).  I'll rebase on latest tp32 
and force-push the PR branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #616: TINKERPOP-1680 Add string performance options t...

2017-05-26 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/616

TINKERPOP-1680 Add string performance options to StarGraph

By default, StarGraph invokes intern() on vertex label and property
key strings.  It also toString()s vertex and edge IDs for comparisons
and existence checks.

This commit introduces a new StarGraph builder with a pair of boolean
options controlling these behaviors.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1680

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/616.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #616


commit 0155665162668f68713bc4fe5a95678a6905c72b
Author: Dan LaRocque 
Date:   2017-04-04T23:41:29Z

TINKERPOP-1680 Add string performance options to StarGraph

By default, StarGraph invokes intern() on vertex label and property
key strings.  It also toString()s vertex and edge IDs for comparisons
and existence checks.

This commit introduces a new StarGraph builder with a pair of boolean
options controlling these behaviors.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #497: Limit JVM system props passed around in Spark j...

2016-11-17 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/497

Limit JVM system props passed around in Spark jobs

Prior to this commit, SGC indiscriminately set the entire job config
as JVM system properties on Spark executors.  It didn't account for
the fact that some config values (e.g. spark.job.description) could
have spaces.  A value with a space wouldn't get quoted.  This led to
Spark workers failing to start, because part of the unquoted value
would be erroneously interpreted as the JVM main class.

This commit makes SGC only pass two config settings as system props:

 * gremlin.io.registry
 * gremlin.io.kryoShimService

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop 
TINKERPOP-1389-jvm-arg-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #497


commit 69877a20ceb9156ecf8ed7062ba3b64f56ab9134
Author: Dan LaRocque 
Date:   2016-11-17T16:33:58Z

Limit JVM system props passed around in Spark jobs

Prior to this commit, SGC indiscriminately set the entire job config
as JVM system properties on Spark executors.  It didn't account for
the fact that some config values (e.g. spark.job.description) could
have spaces.  A value with a space wouldn't get quoted.  This led to
Spark workers failing to start, because part of the unquoted value
would be erroneously interpreted as the JVM main class.

This commit makes SGC only pass two config settings as system props:

 * gremlin.io.registry
 * gremlin.io.kryoShimService




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #462: TINKERPOP-1525 Avoid starting VP worker iterations tha...

2016-10-25 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/462
  
I made a version of this change for TINKERPOP-1389 as #466.  That one's 
optional, but I thought merging this into TINKERPOP-1389 might help ensure this 
fix does not get lost in the eventual possibly-conflict-y and confusing merge 
of TINKERPOP-1389 to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #466: TINKERPOP-1525 Avoid starting VP worker iterati...

2016-10-25 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/466

TINKERPOP-1525 Avoid starting VP worker iterations that never end (Spark 
2.0 version)

This is exactly like #462, except that it tracks a change except it tracks 
a switch between Spark 1.6 and 2.0 away from functions that manipulate 
iterables to those that manipulate iterators.

Assuming #462 eventually gets into master, and assuming that TINKERPOP-1389 
eventually merges with master, the second merge will conflict.  It still seems 
marginally safer to make this change in parallel on TINKERPOP-1389 and 
master/tp32 than just the latter, since the conflict will look more like "oh i 
better keep one of these two almost-identical edge-case checks" than "oh the 
Spark 1.x branch had some silly edge case check that I can just delete for 2.0".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop 
TINKERPOP-1525-for-TINKERPOP-1389

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/466.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #466


commit 2a8f741190beebd7b7e6a9ff7922afb9b6807fa5
Author: Dan LaRocque 
Date:   2016-10-26T00:37:17Z

Avoid starting VP worker iterations that never end

SparkExecutor.executeVertexProgramIteration was written in such a way
that an empty RDD partition would cause it to invoke
VertexProgram.workerIterationStart without ever invoking
VertexProgram.workerIterationEnd.  This seems like a contract
violation.  I have at least one VP that relies on
workerIterationStart|End to allocate and release resources.  Failing
to invoke End like this causes a leak in that VP, as it would for any
VP that uses that resource management pattern.

(cherry picked from commit 36e1159a80f539b8bd4a884e5c1cf304ec52c4f9;
 this is the same change, except it tracks a switch between Spark 1.6
 and 2.0 away from functions that manipulate iterables to those that
 manipulate iterators)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #462: TINKERPOP-1525 Avoid starting VP worker iterations tha...

2016-10-25 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/462
  
I force-pushed this branch with a fix (`emptyIterator -> emptyList`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #462: TINKERPOP-1525 Avoid starting VP worker iterations tha...

2016-10-25 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/462
  
Marko's right, I wrote and tested this against TINKERPOP-1389 (before the 
latest rebase), then cherry-picked against current master without retesting.  
Sorry about that.  He's also right about the Iterator/Iterable change.  I'll 
change it sometime today, force-push the branch, and comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #462: TINKERPOP-1525 Avoid starting VP worker iterati...

2016-10-24 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/462

TINKERPOP-1525 Avoid starting VP worker iterations that never end

SparkExecutor.executeVertexProgramIteration was written in such a way
that an empty RDD partition would cause it to invoke
VertexProgram.workerIterationStart without ever invoking
VertexProgram.workerIterationEnd.  This seems like a contract
violation.  I have at least one VP that relies on
workerIterationStart|End to allocate and release resources.  Failing
to invoke End like this causes a leak in that VP, as it would for any
VP that uses that resource management pattern.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1525-tp32

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/462.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #462


commit 1fc6ec5d7c20d6163c61f84198e3e68457e23eae
Author: Dan LaRocque 
Date:   2016-10-21T21:04:30Z

Avoid starting VP worker iterations that never end

SparkExecutor.executeVertexProgramIteration was written in such a way
that an empty RDD partition would cause it to invoke
VertexProgram.workerIterationStart without ever invoking
VertexProgram.workerIterationEnd.  This seems like a contract
violation.  I have at least one VP that relies on
workerIterationStart|End to allocate and release resources.  Failing
to invoke End like this causes a leak in that VP, as it would for any
VP that uses that resource management pattern.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-10 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/372
  
I'm closing this not because the conversation is settled and everything's 
done, but because this PR is superseded by #377.  We can keep discussing here 
if you like.  That's probably easier than juggling three PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-10 Thread dalaro
Github user dalaro closed the pull request at:

https://github.com/apache/tinkerpop/pull/372


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-10 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/372
  
I opened two PRs, (one for tp31 as Stephen requested, and a new one for 
master):

* #377 (master)
* #378 (tp31)

As far as I can tell, GraphFilter did not exist in 3.1, and it may not even 
be possible to induce the buggy traversal behavior describe in #372 (which was 
all based on master / 3.2). The underlying datastructure corruption -- putting 
a StarOutEdge into the inEdges map -- does exist on 3.1, and @okram's addEdge 
fix applies cleanly on 3.1, but without applyGraphFilter, I'm not sure it even 
matters for correctness on 3.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #378: TINKERPOP-1397 Fix StarGraph.addEdge

2016-08-10 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/378

TINKERPOP-1397 Fix StarGraph.addEdge

See:

* old PR #372
* JIRA issue https://issues.apache.org/jira/browse/TINKERPOP-1397

As far as I can tell, GraphFilter did not exist in 3.1, and it may not even 
be possible to induce the buggy traversal behavior describe in #372 (which was 
all based on master / 3.2).  The underlying datastructure corruption -- putting 
a StarOutEdge into the `inEdges` map -- does exist on 3.1, and @okram's 
`addEdge` fix applies cleanly on 3.1, but without `applyGraphFilter`, I'm not 
sure it even matters for correctness on 3.1.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1397-tp31

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #378


commit 0022b7f6be25eb7d3c778b137beb6e8a7d2784ca
Author: Dan LaRocque 
Date:   2016-08-10T22:52:13Z

TINKERPOP-1397 Fix StarGraph.addEdge

For self-loops, StarGraph.addEdge used to put a single StarOutEdge
into both its inEdges and outEdges maps, potentially causing problems
in applyGraphFilter.

This change makes StarGraph.addEdge put the appropriate type of edge
(Star(Out/In)Edge) in the associated map.  The IDs for each edge
instance are kept in agreement.

This change is @okram's, who suggested it in PR #372.  I merely
reviewed it and added a couple of comments.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #377: TINKERPOP-1397 StarGraph filtering with self ed...

2016-08-10 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/377

TINKERPOP-1397 StarGraph filtering with self edges

See:

* old PR #372
* JIRA issue https://issues.apache.org/jira/browse/TINKERPOP-1397

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop TINKERPOP-1397

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/377.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #377


commit 0022b7f6be25eb7d3c778b137beb6e8a7d2784ca
Author: Dan LaRocque 
Date:   2016-08-10T22:52:13Z

TINKERPOP-1397 Fix StarGraph.addEdge

For self-loops, StarGraph.addEdge used to put a single StarOutEdge
into both its inEdges and outEdges maps, potentially causing problems
in applyGraphFilter.

This change makes StarGraph.addEdge put the appropriate type of edge
(Star(Out/In)Edge) in the associated map.  The IDs for each edge
instance are kept in agreement.

This change is @okram's, who suggested it in PR #372.  I merely
reviewed it and added a couple of comments.

commit 7842c4e7e2e3c2b33fc1bca7a8a80e165080bc59
Author: Dan LaRocque 
Date:   2016-08-10T22:59:54Z

TINKERPOP-1397 Add StarGraph bothE filtering test

commit b4aaa69b9c10a61c7ab2f7476c90e4728ef032e8
Author: Dan LaRocque 
Date:   2016-08-10T23:24:10Z

Merge branch 'TINKERPOP-1397-tp31' into TINKERPOP-1397




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-10 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/372
  
I pulled master (@ fc79de813c788ee141175bbf1ea6d0cdefa94a0f) and confirmed 
that the bug still existed.  I didn't apply your proposed change right away.  
Here a `gremlin.sh` session that shows the buggy behavior.  I'm also pushing a 
unit test that I developed first, but it's easier to see what's going wrong in 
this `gremlin.sh` session than in the source code for a unit test.

```
$ `find gremlin-console/target -name gremlin.sh`

 \,,,/
 (o o)
-oOOo-(3)-oOOo-
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> sg = 
org.apache.tinkerpop.gremlin.structure.util.star.StarGraph.open()
==>stargraph[starOf:null]
gremlin> v = sg.addVertex('name', 'foo')
==>v[0]
gremlin> v.addEdge('self', v)
==>e[2][0-self->0]
gremlin> sg.traversal().V().inE('self')
==>e[2][0-self->0]
gremlin> sg.traversal().V().outE('self')
==>e[2][0-self->0]
gremlin> sg.traversal().V().bothE('self')
==>e[2][0-self->0]
==>e[2][0-self->0]
// NOTE: StarOutEdge in the inEdges datastructure
gremlin> v.inEdges.values().iterator().next().get(0).getClass()
==>class 
org.apache.tinkerpop.gremlin.structure.util.star.StarGraph$StarOutEdge
gremlin> filter = new 
org.apache.tinkerpop.gremlin.process.computer.GraphFilter()
==>graphfilter[none]
gremlin> filter.setEdgeFilter(__.bothE("self"))
==>null
gremlin> sg.applyGraphFilter(filter)
==>Optional[stargraph[starOf:v[0]]]
// NOTE: no inE results, double outE results after GraphFilter is applied
gremlin> sg.traversal().V().inE('self')
gremlin> sg.traversal().V().outE('self')
==>e[2][0-self->0]
==>e[2][0-self->0]
gremlin> sg.traversal().V().bothE('self')
==>e[2][0-self->0]
==>e[2][0-self->0]
gremlin>
```

There are a couple of things worth noting above:

* the `inEdges` datastructure contains a `StarOutEdge` when we construct 
the graph this way
* the final post-filtering `inE` and `outE` are incorrect (an indirect 
consequence of the first point)

These steps produce different behavior depending on whether I invoke 
`StarGraph.open()` or call `StarGraph.of()`.  If 
I do the latter, I get a `StarInEdge` in the `inEdges` map and a `StarOutEdge` 
in the `outEdges` map, and in turn avoid this particular filtering bug.

**Good news**: your change makes sense and fixes the bug I initially 
reported (both-filtering).

**Bad news**: I think there are more graph filtering bugs on StarGraph that 
can create half-edges

When I filter a self-loop star graph for either in or out, I get stuff like 
this (symmetrical for `.inE`):

```
gremlin> filter = new 
org.apache.tinkerpop.gremlin.process.computer.GraphFilter()
==>graphfilter[none]
gremlin> filter.setEdgeFilter(__.outE("self"))
==>null
gremlin> sg.applyGraphFilter(filter)
==>Optional[stargraph[starOf:v[0]]]
gremlin> sg.traversal().V()
==>v[0]
gremlin> sg.traversal().V().inE('self')
gremlin> sg.traversal().V().outE('self')
==>e[2][0-self->0]
gremlin> sg.traversal().V().bothE('self')
==>e[2][0-self->0]
gremlin>
```

@okram: This doesn't look correct, but then again, I don't really know what 
it should do.  Should it completely destroy the self-loop, so that 
inE/bothE/outE all fail to emit anything associated with the self-loop?  Should 
it transform the self-loop into a single-directional edge (which seems like one 
interpretation for what it does now, in which case it might be correct)?

I noticed that the name of my original branch is nonsense ("DSP-1397").  I 
don't think that can be changed after a PR is created.  So, I'm opening a pair 
of new PRs: one for tp31 and one for master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-05 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/372
  
Thanks for the replies.

@spmallette I will investigate whether this behavior also exists on 3.1 and 
retarget if so.  I think this will involve making a standalone test instead of 
just testing TP against my app and stimulating the failure that way.

@okram Yes, I realize that bothE() should return two edges.  The problem 
this PR tried to address was that, after applying an inE graph filter to a 
self-looped StarVertex, the StarVertex had two edges in its outEdges map and no 
edges in its inEdges map (should be one each).  I think this is not just some 
obscure internals problem, because I discovered it through a traversal.  Maybe 
I can articulate this better with a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop issue #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-05 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/tinkerpop/pull/372
  
@okram also, thanks for the code; I'll have a closer look at it when I 
iterate on this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #372: DSP-1397 Fix StarGraph addEdge for self-edges

2016-08-03 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/tinkerpop/pull/372

DSP-1397 Fix StarGraph addEdge for self-edges

StarGraph has directionally-specific classes for its edges:

* StarInEdge
* StarOutEdge

StarIn(/Out)Edges belong only in a StarVertex's in(/out)Edges map.  We
see this assumption in StarVertex.applyGraphFilter, which invokes
`graphFilter.legalEdges(this)` and then invokes `instanceof
StarOutEdge` on each legal edge to determine its direction.

This assumption was violated by the StarVertex.addEdge method.  This
method special-cases self-edge addition.  This method first
unconditionally adds a StarOutEdge.  Then, it checks to see whether
its parameters indicate that it is adding a self-edge.  If this is so,
the method proceeds to add that same StarOutEdge instance to the
inEdges map.  This is the problem.  Adding a self-edge to a StarVertex
results in one StarOutEdge instance that belongs to both the inEdges
and outEdges map, which later causes the `applyGraphFilter` method's
`instanceof` checks to yield two out edges and no in edges on a
self-loop.

This commit changes StarVertex so that self-edges are represented
by a new class, StarSelfEdge.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop DSP-1397

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tinkerpop/pull/372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #372


commit dddc44369e2c645a854b2a7bed0f4af51acd6fc5
Author: Dan LaRocque 
Date:   2016-08-04T04:41:44Z

DSP-1397 Fix StarGraph addEdge for self-edges

StarGraph has directionally-specific classes for its edges:

* StarInEdge
* StarOutEdge

StarIn(/Out)Edges belong only in a StarVertex's in(/out)Edges map.  We
see this assumption in StarVertex.applyGraphFilter, which invokes
`graphFilter.legalEdges(this)` and then invokes `instanceof
StarOutEdge` on each legal edge to determine its direction.

This assumption was violated by the StarVertex.addEdge method.  This
method special-cases self-edge addition.  This method first
unconditionally adds a StarOutEdge.  Then, it checks to see whether
its parameters indicate that it is adding a self-edge.  If this is so,
the method proceeds to add that same StarOutEdge instance to the
inEdges map.  This is the problem.  Adding a self-edge to a StarVertex
results in one StarOutEdge instance that belongs to both the inEdges
and outEdges map, which later causes the `applyGraphFilter` method's
`instanceof` checks to yield two out edges and no in edges on a
self-loop.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tinkerpop pull request #:

2016-07-06 Thread dalaro
Github user dalaro commented on the pull request:


https://github.com/apache/tinkerpop/commit/f9706a38fbe546c0a7e5b038a6b99303d21aec01#commitcomment-18144496
  
@okram i don't understand the failure this caused, but the change seems 
innocuous to me... no objections


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Just a reminder: we are voting on branch 
https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321.  That 
subsumes the PR and adds a combination of fixes from Marko and me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
These questions involved a lot of back-and-forth that would have been slow 
to resolve on github comments, so and I discussed the failures and the code on 
a hangout earlier today.

Summary:

* Marko created 
https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP-1321 on the TP 
repo.  This branch is my PR plus additional tweaks Marko and I worked out on 
top.  His branch supersedes this PR branch now.  This PR branch **should not be 
merged because Marko's branch subsumes/extends it**.
* I have a rough idea of what changed as we discussed code on the hangout, 
but I will go through his branch's new commits and review.
* The new serialization code is lower-priority than the old code.  It has 
to be turned on through explicit user action, like this is the test provider 
code:
```
System.setProperty(SHIM_CLASS_SYSTEM_PROPERTY, 
UnshadedKryoShimService.class.getCanonicalName());
config.put("spark.serializer", 
KryoSerializer.class.getCanonicalName());
config.put("spark.kryo.registrator", 
GryoRegistrator.class.getCanonicalName());
```
* Marko setup SparkHadoopGraphProvider to randomly choose either the old 
serialization code or the new stuff in 
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-afffc6796745845d46e6f60ea3a992f9R91.
  IMHO that should be overrideable to make it deterministic, but it's not a 
huge deal since it's limited to test code.
* We replaced `GryoClassResolver` with a series of class registrations in 
https://github.com/apache/incubator-tinkerpop/commit/e7003635e27c625b3f30492111f20f4fe4e24eb5#diff-2e881d0a6ab7ed1ca7d1f593bceadfd2R207.
  The idea here is to explicitly configure which types need a detaching 
serializer rather than relying on `GryoClassResolver` to automagically do it 
regardless of explicit class registrations.  Three factors influence this 
choice:
   * Spark's KryoSerializer does not support a user-provided ClassResolver, 
and working around that limitation entails either getting a change through 
Spark upstream or copying and pasting a bunch of KryoSerializer's 
private/hardcoded class registrations, some of which could change in future 
Spark versions.  The latter is ugly and would make the code brittle across 
Spark upgrades.
   * `GryoClassResolver` is convenient but somewhat mysterious to third 
parties, who are likelier to have seen a custom class registration mechanism 
(like TP's `IoRegistry` or Spark's `spark.kryo.registrator`) than a custom 
resolver.  It also removes vendor control in that its detachment behavior is 
not configurable, so modifying it requires subclassing or reimplementation. For 
instance, if a vendor had a lightweight Vertex implementation with a custom 
serializer and wanted to bypass TP's resolver-driven detachment, then I don't 
think it would be possible without modifying TP's resolver.
   * `GryoClassResolver` is written against a bunch of shaded types and 
relies on `readVarInt`/`writeVarInt`, which were added in Kryo 2.22, one 
version too late for compatibility with Spark's Kryo 2.21.  This is the least 
important concern of the three, more an implementation detail.
* Marko is running the full integration test suite now.  He ran the spark 
ITs just before pushing and reported success on those specifically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I saw the `shouldHandleMonthDay` test failure on CI and pushed a fix.  I 
carelessly broke the read leg of that serializer while doing some tedious, 
boilerplate-y method signature changes to port a batch a serializers to the 
shim to support IoRegistry unification.  The one-liner commit that I just 
pushed fixed the problem locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-06 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
I just pushed some changes that I hacked together this weekend.  The key 
additions are:

* `TinkerPopKryoRegistrator`, which I extracted from my app, and which acts 
as a `spark.kryo.registrator` impl that knows about TinkerPop types
* `IoRegistryAwareKryoSerializer`, which is a Spark `Serializer` that looks 
for `GryoPool.CONFIG_IO_REGISTRY` and applies it if present
* `KryoShimLoaderService.applyConfiguration(cfg)`, which replaces direct 
calls to `HadoopPools.initialize(cfg)` and adds equivalent functionality for 
initializing the unshaded Kryo serializer pool

The user would theoretically just set

```

spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.IoRegistryAwareKryoSerializer

spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.TinkerPopKryoRegistrator
# Optional, only needed for custom types
gremlin.io.registry=whatever.user.IoRegistryImpl
```

In practice, when I have a custom gremlin.io.registry, I have always had to 
take the additional step (long before this PR) of forcibly initializing 
`HadoopPools` before touching SparkGraphComputer in my app, or else some part 
of Spark -- I think the closure serializer -- would attempt to use HadooPools 
via ObjectWritable/VertexWritable before initialization and produce garbage on 
my custom classes.  **This problem predates my PR**. I'm not trying to solve it 
here, in part because I still don't know if it's a pathology specific to my app 
or because TinkerPop is missing a crucial `HadoopPools.initialize` (now, 
equivalently, `KryoShimLoaderService.applyConfiguration`) call somewhere, and 
in part because HadoopPools is such a hideous architectural wart that the 
ultimate solution probably involves destroying it.

In the past, I've worked around this by defining a custom spark.serializer 
that delegates newKryo() to a GryoSerializer/IoRegistryAwareSerializer, but 
which has a constructor that invokes 
`HadoopPools.initialize`/`KryoShimLoaderService.applyConfiguration` (relying on 
that method's idempotence).

Again, this initialization step just be specific to my app and unnecessary 
for the average TinkerPop user.  It's possible that the config I pasted above 
will work for others.

FWIW, this passes, so the overrides bug should be fixed along with all this 
refactoring stuff:

```
mvn clean install -DskipTests=true && mvn verify -pl gremlin-server 
-DskipIntegrationTests=false -Dtest.single=GremlinResultSetIntegrateTest
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Additional notes from Marko:

> okay, so as long as there is never a need for MyRegistrar, then can you 
please update your PR accordingly?
> Finally, call it GryoRegistrar (not TinkerPopRegistrar) as this has to do 
with Gryo IORegistry classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
After discussion on Slack, I think @okram and I tentatively agreed to 
proceed with this PR after I do additional work to save users who have custom 
serializers the effort of maintaining separate `IoRegistry` and 
`spark.kryo.registrator` implementations with near-identical contents.

I may discover other complications during the process, but I think this 
involves two changes:

1. I will attempt to subclass KryoSerializer so that I can access the 
SparkConf passed to its constructior and check for 
`GryoPool.CONFIG_IO_REGISTRY` (similar to what GryoSerializer does now), then 
apply any registrations found therein so long as each registration either:

   * specifies no explicit serializer (using Kryo's internal default) or
   * specifies a shim serializer

   In particular, if the registry contains an old-style TP shaded Kryo 
serializer that hasn't been ported to the shim, then the KryoSerializer 
subclass will throw an exception.

   This change is necessary to support custom-serialized, 
`IoRegistry`-listed datatypes that appear in most Spark data outside of 
closures (say, in the RDD itself).

2. I will replace current callsites of `HadoopPools.initialize(conf)` with 
some other static method that calls `HadoopPools.initialize(conf)` and then 
calls some roughly equivalent `initialize(conf)` method for the static KryoPool 
backing `KryoPoolShimService`.

   This change is necessary to support custom-serialized, 
`IoRegistry`-listed datatypes that appear in closures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
The problem is that GryoSerializer doesn't actually work.  This is not the 
first time it has shown behavior on Spark or scala runtime classes which 
diverges from the behavior of Spark's KryoSerializer/JavaSerializer.

I realize that asking users to write custom serializers against the shim 
(to be compatible with everything) in the long term or to duplicate 
registrations in the short term (if they want to use Gryo in one place and 
Spark graph computation in another, using the same custom types) is not ideal, 
but I think the status quo is even worse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
BTW, in case you're wondering why I think it's hard to create proxies, 
recall that o.a.t.shaded.kryo.{Kryo,io.Input,io.Output} are all classes, not 
interfaces.  There are a lot of approaches that would make this work, but the 
ones I can think of suck for various reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
To do that cleanly, users would have to write serializers against the shim 
interface.

Of course, the shim is only relevant to users writing new serializers or 
willing to refactor their old ones.

I see no clean way to make this work with existing serializers that users 
have hypothetically already compiled against TP's shaded Kryo.  To make a user 
serializer written for TP's shaded Kryo link against Spark's unshaded Kryo is 
technically possible with some degree of skulduggery, it's just that all of the 
ways to do it that I'm aware of are unspeakably ugly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-03 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
@spmallette You're right about overrides being the problem.

The last thing I did before opening this PR was to rebase on latest master, 
when I encountered conflicts solely involving the GryoMapper registration 
override feature added in d7684757734732f39970265a425b921a48d8f0fb.  I 
attempted to resolve those conflicts too hastily and got it wrong.

I went back to Stephen's original commit reworked my code around 
registration overrides, and I now think it conforms to the old behavior 
precedent.  Specifically, when handling overrides of GryoMapper type 
registrations, I departed from the precedent set by Stephen's commit in two 
ways, both of which I've fixed locally:

* Override registrations allocated a new registration ID.  Your commit used 
the old ID from the registration being overridden.
* As a side effect of the preceding point, I incremented the shared 
`currentSerializationId` for all registration calls.  Your commit incremented 
this shared counter only for non-override registration calls.

The test passes locally for me now.  I'm still doing some registrator 
refactoring and haven't pushed yet though.

@okram The problem with custom registrations is that GryoMapper allows the 
user to provide a serializer written against shaded Kryo.  This is not 
compatible with Spark's KryoSerializer.  I don't see a way to make it 
compatible without putting fake `org.apache.tinkerpop.shaded.kryo.*` classes on 
the classpath, which could get pretty ugly.

Instead, I'm adding a registrator that includes:

* all default mappings from GryoMapper (I am changing the 
shaded-Kryo-serializers to be shim serializers so that they are reusable)
* the TinkerPop classes registered in GryoSerializer's constructors (but 
not the scala and Spark classes registered in there)

If the user wants to register custom types while using KryoSerializer, they 
can extend the registrator in a subclass, then set `spark.kryo.registrator` to 
the subclass.  The interface in question is `KryoRegistrator` and it has only 
one method, `registerClasses(Kryo)`.  It's about as simple -- maybe even less 
so -- than implementing TP's `IoRegistry` interface, which has two methods.

If the user decides to continue using GryoSerializer, they can also 
continue using `GryoPool.CONFIG_IO_REGISTRY`, which should still affect 
GryoSerializer/HadoopPools as it always has.

WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-tinkerpop issue #325: TINKERPOP-1321 Introduce Kryo shim to suppor...

2016-06-02 Thread dalaro
Github user dalaro commented on the issue:

https://github.com/apache/incubator-tinkerpop/pull/325
  
Thanks for the comments and for running the tests.

@okram:

1. I also dislike HadoopPools and would prefer to remove it, but I don't 
know how to do it correctly, even after considering the problem for a while.  
Here are my open questions and observations.

   HadoopPools apparently exists to help Vertex/ObjectWritable implement 
Serializable.  They must implement Serializable because they keep showing up in 
`spark.closure.serializer`'s workload, and only JavaSerializer can be used as 
`spark.closure.serializer`.  A natural question here: is it reasonable and 
normal for graph types to appear in these closures in the first place?

   Spark has this nifty `spark.kryo.registrator` mechanism to support 
user-registered types with Kryo, and its docs talk at length about setting 
`spark.serializer`.   The docs read as though that was all the user would need 
to customize to deal with custom data.  `spark.closure.serializer` seems to 
barely be mentioned.  I've only seen it in the manual's exhaustive table of 
config options (http://spark.apache.org/docs/latest/configuration.html).  Even 
then, the docs merely comment that JavaSerializer is its only supported value.  
There's some old list traffic that says KryoSerializer is unsupported because, 
back when it was originally tested, it blew up on pretty much any Java closure, 
though that might have changed since.

   I wonder if this indicates some kind of usage antipattern.  Could there 
be an alternate approach to traversal/vertexprogram evaluation in 
SparkGraphComputer/SparkExecutor that avoids putting graph data into Spark 
closures?  That would make the use-case for HadoopPools disappear.  Then again, 
it seems so easy to get into this situation with `rdd.map(x -> whatever(x))`.  
I am full of uncertainty on this point.  I don't have deep enough Spark 
knowledge to judge whether what we're doing now is reasonable.

   As long as Object/VertexWritable keep popping up in closures created by 
SparkGraphComputer & friends, we need some way to serialize them and everything 
they refer to, which can apparently involve a lot of graph entities.  We could 
theoretically implement Serializable "manually" (without Kryo/Gryo), but that 
seems like a massive amount of work, potentially inefficient, definitely 
redundant, and adds a huge new bug surface.  I'm against that approach, but 
that's about the only thing I know for certain here.

2. In the long term, I think users should set

   ```
spark.serializer=KryoSerializer
spark.kryo.registrator=com.tinkerpop.something.TinkerPopRegistrator
```
   and TinkerPop should stop registering/caring about Spark and 
scala-runtime types in its serialization layer.  This PR tries to keep 
GryoSerializer untouched though.  That migration doesn't have to be forced.

  I'm already using the `spark.kryo.registrator` setting in my app, where I 
have an impl that registers all TinkerPop types plus all of my app's types.  I 
can refactor that by extracting a TinkerPopRegistrator that covers the TP types 
(but not my app's).  I probably should have done that at the start.  As it 
stands, this PR is sort of a "build-your-own-alternative-to-GryoSerializer" 
kit, but the very last assembly steps involving a `spark.kryo.registrator` impl 
are left as an exercise to the user.  I'll try to introduce a 
TinkerPopRegistrator to address that, though I may reconsider if supporting 
custom type registration turns out way more complicated than I'm imagining.

  I think Gryo more generally is useful and can stay around indefinitely 
(say, for reading/writing data to the filesystem), but GryoSerializer should 
eventually stop being the recommended spark.serializer.  If that is all 
GryoSerializer per se exists to do, then it could be removed and any 
TinkerPop-specific class registrations that it performs moved to 
TinkerPopRegistrator.

3.  This PR should maintain full format compatibility.  I deliberately 
tried to keep all of GryoMapper and GryoSerializer's class registrations, class 
IDs, and associated serializers (including StarGraphSerializer) functionally 
unchanged.  If you see somewhere that I unintentionally changed the format, 
that's a bug.

@spmallette:

re. package organization: definitely makes sense, I'll move kryoshim under 
io.gryo.  Please let me know if I bungled the package hierarchy anywhere else.

re. path forward: I think 1 and 2 above touch on both HadoopPools and 
GryoSerializer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feat

[GitHub] incubator-tinkerpop pull request #325: TINKERPOP-1321 Introduce Kryo shim to...

2016-06-02 Thread dalaro
GitHub user dalaro opened a pull request:

https://github.com/apache/incubator-tinkerpop/pull/325

TINKERPOP-1321 Introduce Kryo shim to support serializer reuse

https://issues.apache.org/jira/browse/TINKERPOP-1321

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dalaro/incubator-tinkerpop 
serialization-flexibility

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-tinkerpop/pull/325.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #325


commit ef52869788ebf5b8b825f78ef21e1d38423d9aa0
Author: Dan LaRocque 
Date:   2016-06-02T07:09:29Z

Introduce Kryo shim to support serializer reuse




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---