[jira] [Commented] (TINKERPOP-1335) OLAP queries potentially fail for certain match()/select() query patterns

2016-06-13 Thread Marko A. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/TINKERPOP-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327464#comment-15327464
 ] 

Marko A. Rodriguez commented on TINKERPOP-1335:
---

I can also confirm that this only happens with {{SparkGraphComputer}}. Both 
{{TinkerGraphComputer}} and {{GiraphGraphComputer}} work as expected.

> OLAP queries potentially fail for certain match()/select() query patterns
> -
>
> Key: TINKERPOP-1335
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1335
> Project: TinkerPop
>  Issue Type: Bug
>  Components: hadoop, process
>Affects Versions: 3.2.0-incubating
>Reporter: Daniel Kuppitz
>Assignee: Marko A. Rodriguez
>
> There are certain queries that return wrong results when executed via 
> {{SparkGraphComputer}}. After testing a few queries I would say that the 
> problematic query pattern is a {{match()}} / {{select()}} combo.
> For example (Grateful Dead graph):
> {code}
> gremlin> g.V().hasLabel("song").match(
>  __.as("a").values("name").as("name"),
>  __.as("a").values("performances").as("performances")
>).select("name","performances").count()
> ==>0
> {code}
> If {{count()}} is replaced by {{program()}}, the whole thing is going to 
> throw exceptions. However, if we select {{a}} instead of {{name}} and 
> {{performances}}, we get correct result. Likewise, if we remove the 
> {{select()}} or just rewrite the {{match()}} part, everything works as 
> expected. The simplest query to reproduce the erroneous behavior is this one:
> {code}
> g.V().match(__.as("a").values("name").as("name")).select("name").count()
> {code}
> The tests were done using a real Spark Server. I didn't try to use Spark in 
> local mode or Giraph. I did try {{TinkerGraphComputer}}, which worked fine.
> Here's an actual stacktrace that shows were to find the root of all evil:
> {noformat}
> ERROR 2016-06-09 21:24:25,988 Logging.scala:95 - 
> org.apache.spark.executor.Executor: Exception in task 0.2 in stage 119.0 (TID 
> 307)
> java.lang.IllegalStateException: The host of the object is unknown: 
> {a=v[{~label=Comment, member_id=2034, community_id=1676454656}], content=ok, 
> length=2}:java.util.LinkedHashMap
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.getHostingVertex(WorkerExecutor.java:242)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$262(WorkerExecutor.java:220)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$113/1202183304.accept(Unknown
>  Source) ~[na:na]
> at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_40]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:215)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:146)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:285)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$9(SparkExecutor.java:111)
>  ~[spark-gremlin-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$92/910806192.apply(Unknown
>  Source) ~[na:na]
> at 
> org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:42) 
> ~[scala-library-2.10.6.jar:na]
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> ~[scala-library-2.10.6.jar:na]
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
> ~[scala-library-2.10.6.jar:na]
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
>  ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>  ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
> ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
> ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 

[jira] [Commented] (TINKERPOP-1335) OLAP queries potentially fail for certain match()/select() query patterns

2016-06-13 Thread Marko A. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/TINKERPOP-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327452#comment-15327452
 ] 

Marko A. Rodriguez commented on TINKERPOP-1335:
---

Note that I just confirmed via a test case with {{GryoInputFormat}} that the 
wrong answer is produced by {{SparkGraphComputer}}. This is good as now we can 
isolate this to TinkerPop solely and can test it in our test suite without the 
need for Spark Server infrastructure.

> OLAP queries potentially fail for certain match()/select() query patterns
> -
>
> Key: TINKERPOP-1335
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1335
> Project: TinkerPop
>  Issue Type: Bug
>  Components: hadoop, process
>Affects Versions: 3.2.0-incubating
>Reporter: Daniel Kuppitz
>Assignee: Marko A. Rodriguez
>
> There are certain queries that return wrong results when executed via 
> {{SparkGraphComputer}}. After testing a few queries I would say that the 
> problematic query pattern is a {{match()}} / {{select()}} combo.
> For example (Grateful Dead graph):
> {code}
> gremlin> g.V().hasLabel("song").match(
>  __.as("a").values("name").as("name"),
>  __.as("a").values("performances").as("performances")
>).select("name","performances").count()
> ==>0
> {code}
> If {{count()}} is replaced by {{program()}}, the whole thing is going to 
> throw exceptions. However, if we select {{a}} instead of {{name}} and 
> {{performances}}, we get correct result. Likewise, if we remove the 
> {{select()}} or just rewrite the {{match()}} part, everything works as 
> expected. The simplest query to reproduce the erroneous behavior is this one:
> {code}
> g.V().match(__.as("a").values("name").as("name")).select("name").count()
> {code}
> The tests were done using a real Spark Server. I didn't try to use Spark in 
> local mode or Giraph. I did try {{TinkerGraphComputer}}, which worked fine.
> Here's an actual stacktrace that shows were to find the root of all evil:
> {noformat}
> ERROR 2016-06-09 21:24:25,988 Logging.scala:95 - 
> org.apache.spark.executor.Executor: Exception in task 0.2 in stage 119.0 (TID 
> 307)
> java.lang.IllegalStateException: The host of the object is unknown: 
> {a=v[{~label=Comment, member_id=2034, community_id=1676454656}], content=ok, 
> length=2}:java.util.LinkedHashMap
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.getHostingVertex(WorkerExecutor.java:242)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$262(WorkerExecutor.java:220)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$113/1202183304.accept(Unknown
>  Source) ~[na:na]
> at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_40]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:215)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:146)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:285)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$9(SparkExecutor.java:111)
>  ~[spark-gremlin-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$92/910806192.apply(Unknown
>  Source) ~[na:na]
> at 
> org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
>  ~[gremlin-core-3.2.1-20160601-aa673db1.jar:3.2.1-20160601-aa673db1]
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:42) 
> ~[scala-library-2.10.6.jar:na]
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> ~[scala-library-2.10.6.jar:na]
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
> ~[scala-library-2.10.6.jar:na]
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
>  ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>  ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
> ~[spark-core_2.10-1.6.1.2.jar:1.6.1.2]
> at 
> 

Move Hadoop-Gremlin doccs to Provider Docs

2016-06-13 Thread Stephen Mallette
Is there any reason this section is in the reference documentation:

http://tinkerpop.apache.org/docs/current/reference/#_hadoop_gremlin_for_graph_system_providers

seems like it belongs here somewhere:

http://tinkerpop.apache.org/docs/current/dev/provider/


[jira] [Closed] (TINKERPOP-1144) Improve ScriptElementFactory

2016-06-13 Thread Daniel Kuppitz (JIRA)

 [ 
https://issues.apache.org/jira/browse/TINKERPOP-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Kuppitz closed TINKERPOP-1144.
-
Resolution: Fixed

Not sure why this ticket was still open, it was done long time ago, probably as 
part of another ticket.

> Improve ScriptElementFactory
> 
>
> Key: TINKERPOP-1144
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1144
> Project: TinkerPop
>  Issue Type: Improvement
>  Components: process
>Reporter: Daniel Kuppitz
>Assignee: Daniel Kuppitz
> Fix For: 3.2.1
>
>
> From https://github.com/apache/incubator-tinkerpop/pull/219:
> * Deprecate {{ScriptElementFactory}}.
> * Update to use {{StarGraph}} directly, but if the user's script has a method 
> for {{ScriptElementFactory}}, then use that method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TINKERPOP-1337) Provide an "add jar" endpoint

2016-06-13 Thread Daniel Kuppitz (JIRA)

 [ 
https://issues.apache.org/jira/browse/TINKERPOP-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Kuppitz updated TINKERPOP-1337:
--
Description: 
Gremlin Server should provide something (an endpoint?) that allows the user to 
add new jar files, without the need to restart the server.

We've talked about it before, but I thought it might be a good idea to have a 
ticket where we collect some thoughts.

One particular problem we've talked about is this: What if someone wants to 
update a previously loaded jar? The initial loading of a new jar file seems to 
be a smaller problem; unloading a jar file to update it with a newer version 
seems to be a real problem. I would say we simply shouldn't support that. I've 
looked into other projects (e.g. Hive) and there're ways to load new jars, but 
not to unload them later. If you really need to get rid of a previously loaded 
jar, then you'll have to restart the server / JVM.

Another problem I see are distributed environments, where you have multiple 
Gremlin Servers running (none knows about the existence of the others) that are 
requested in a round-robin fashion. I don't have a good idea on how to handle 
this problem, but a first step in the right direction may be to allow uploads 
of jar files to distributed file systems. Perhaps Gremlin Server instances 
could then monitor the contents of a predefined directory within the DFS.

  was:
Gremlin Server should provide something (an endpoint?) that allows the user to 
add new jar files, without the need to restart the server.

We've talked about it before, but I thought it might be good idea to have a 
ticket where we collect some thoughts.

One particular problem we've talked about is this: What if someone wants to 
update a previously loaded jar? The initial loading of a new jar file seems to 
be a smaller problem; unloading a jar file to update it with a newer version 
seems to be a real problem. I would say we simply shouldn't support that. I've 
looked into other projects (e.g. Hive) and there're ways to load new jars, but 
not to unload them later. If you really need to get rid of a previously loaded 
jar, then you'll have to restart the server / JVM.


> Provide an "add jar" endpoint
> -
>
> Key: TINKERPOP-1337
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1337
> Project: TinkerPop
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.2.0-incubating
>Reporter: Daniel Kuppitz
>
> Gremlin Server should provide something (an endpoint?) that allows the user 
> to add new jar files, without the need to restart the server.
> We've talked about it before, but I thought it might be a good idea to have a 
> ticket where we collect some thoughts.
> One particular problem we've talked about is this: What if someone wants to 
> update a previously loaded jar? The initial loading of a new jar file seems 
> to be a smaller problem; unloading a jar file to update it with a newer 
> version seems to be a real problem. I would say we simply shouldn't support 
> that. I've looked into other projects (e.g. Hive) and there're ways to load 
> new jars, but not to unload them later. If you really need to get rid of a 
> previously loaded jar, then you'll have to restart the server / JVM.
> Another problem I see are distributed environments, where you have multiple 
> Gremlin Servers running (none knows about the existence of the others) that 
> are requested in a round-robin fashion. I don't have a good idea on how to 
> handle this problem, but a first step in the right direction may be to allow 
> uploads of jar files to distributed file systems. Perhaps Gremlin Server 
> instances could then monitor the contents of a predefined directory within 
> the DFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TINKERPOP-1337) Provide an "add jar" endpoint

2016-06-13 Thread Daniel Kuppitz (JIRA)
Daniel Kuppitz created TINKERPOP-1337:
-

 Summary: Provide an "add jar" endpoint
 Key: TINKERPOP-1337
 URL: https://issues.apache.org/jira/browse/TINKERPOP-1337
 Project: TinkerPop
  Issue Type: Improvement
  Components: server
Affects Versions: 3.2.0-incubating
Reporter: Daniel Kuppitz


Gremlin Server should provide something (an endpoint?) that allows the user to 
add new jar files, without the need to restart the server.

We've talked about it before, but I thought it might be good idea to have a 
ticket where we collect some thoughts.

One particular problem we've talked about is this: What if someone wants to 
update a previously loaded jar? The initial loading of a new jar file seems to 
be a smaller problem; unloading a jar file to update it with a newer version 
seems to be a real problem. I would say we simply shouldn't support that. I've 
looked into other projects (e.g. Hive) and there're ways to load new jars, but 
not to unload them later. If you really need to get rid of a previously loaded 
jar, then you'll have to restart the server / JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TINKERPOP-1331) HADOOP_GREMLIN_LIBS can only point to local file system

2016-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TINKERPOP-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327012#comment-15327012
 ] 

ASF GitHub Bot commented on TINKERPOP-1331:
---

Github user dkuppitz commented on the issue:

https://github.com/apache/tinkerpop/pull/334
  
The failing Travis test run contains a massive amount of exceptions in 
`RemoteGraphProcessComputerTest` which ultimately led to:

```
The log length has exceeded the limit of 4 MB (this usually means that the 
test suite is raising the same exception over and over).

The job has been terminated
```

I don't think that's in any way related to my PR.


> HADOOP_GREMLIN_LIBS can only point to local file system
> ---
>
> Key: TINKERPOP-1331
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1331
> Project: TinkerPop
>  Issue Type: Improvement
>  Components: hadoop
>Affects Versions: 3.2.0-incubating, 3.1.2-incubating
>Reporter: Daniel Kuppitz
>Assignee: Daniel Kuppitz
>
> These two lines in {{SparkGraphComputer}} assume that {{HADOOP_GREMLIN_LIBS}} 
> will only contain local file system references (although it seems that the 
> rest of the code could handle DFS references):
> {code}
> final String[] paths = hadoopGremlinLocalLibs.split(":");
> final FileSystem fs = FileSystem.get(hadoopConfiguration);
> {code}
> If, for example, {{HADOOP_GREMLIN_LIBS}} would be set to 
> {{hdfs:///spark-gremlin-libs:/foo/bar}}, the {{split(":")}} call would 
> obviously separate the file system scheme ({{hdfs://}}) from the path 
> ({{/spark-gremlin-libs}}).
> Next, {{FileSystem.get(hadoopConfiguration)}} will always only return a 
> reference to the {{FileSystem}} that is defined as the default file system.
> The same is probably true for {{GiraphGraphComputer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tinkerpop issue #334: TINKERPOP-1331 HADOOP_GREMLIN_LIBS can only point to l...

2016-06-13 Thread dkuppitz
Github user dkuppitz commented on the issue:

https://github.com/apache/tinkerpop/pull/334
  
The failing Travis test run contains a massive amount of exceptions in 
`RemoteGraphProcessComputerTest` which ultimately led to:

```
The log length has exceeded the limit of 4 MB (this usually means that the 
test suite is raising the same exception over and over).

The job has been terminated
```

I don't think that's in any way related to my PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---