[jira] [Commented] (ATLAS-1868) Highly inefficient DSL-queries

Graham Wallis (JIRA) Wed, 14 Jun 2017 06:26:29 -0700

    [ 
https://issues.apache.org/jira/browse/ATLAS-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049176#comment-16049176
 ]


Graham Wallis commented on ATLAS-1868:
--------------------------------------

Hi Christian,

This is a really interesting observation. It clearly makes sense to start the 
traversal at the indexed vertex type.

I decided to experiment with Atlas 0.9 and then did some experimentation with 
Titan 1.0 (directly, without Atlas). I then experimented with declarative-style 
gremlin queries and finally explored the Titan index APIs to see if I could 
think of a way to implement an automated optimization. The results were as 
follows:

h2. Experiment with Atlas 0.9
I tried the DSL query below with Atlas 0.9 and it produced the following 
gremlin - which I think is similar in structure to Christian's example (from 
Atlas 0.7). It uses a local variable to collate results but it still starts the 
traversal at the non-indexed end of the edge, instead of reversing it.

DSL query: 
Table where columns.name="customer_id"

Translated Gremlin Query: 
{
  def r=(([]) as Set);
  def f1={
    GremlinPipeline 
x->x.as('a0').out('__Table.columns').has('Column.name',T.'eq','customer_id').
      back('a0').as('__res') [0..<25].select(['a0', '__res']).fill(r)
    };
  f1(g.V().has('__typeName','Table'));
  f1(g.V().has('__superTypeNames','Table'));
  r._().as('__tmp').
    transform({((Row)it).getColumn('a0')}).as('a0').back('__tmp').
    transform({((Row)it).getColumn('__res')}).as('__res') [0..<25].toList()
}

This translation was on Atlas 0.9 using Titan 0.5.4. I did not try it with 
Titan 1.0 because there are parts of Atlas that either do not support Java 8 
yet or have hard dependencies on Tinkerpop 2.

h2. Experiments directly on Titan 1.0
I decided to do some exploration on Titan 1.0 - because I wanted to see if 
Titan's gremlin optimization would handle this automatically. The answer is it 
doesn't seem to. I populated a Titan 1.0 graph with 100K 'Person' vertices that 
each have one 'owns' edge to one of 100K 'Asset' vertices. i.e. there are 100K 
triples of Person->owns->Asset. I created a composite index on Asset.assetid, 
with indexOnly() of vertices with label 'Asset' on the basis that different 
vertex types may have properties with similar keys, so we would probably need a 
label-specific index).

conf=new BaseConfiguration()
graph = TitanFactory.open('../conf/titan-berkeleyje.properties')

mgmt=graph.openManagement()
assetid = mgmt.makePropertyKey('assetid').dataType(String.class).make()
asset=mgmt.makeVertexLabel('Asset').make()
mgmt.buildIndex('assetById',Vertex.class).addKey(assetid).indexOnly(asset).buildCompositeIndex()
mgmt.commit()


for (i in 1..100000) {
  p=graph.addVertex(label,'Person','name',i)
  a=graph.addVertex(label,'Asset','assetid','asset-num-'+i)
  p.addEdge('owns',a)
}
graph.tx().commit()

I could then issue queries like the following:

// 'Forward' query
t=System.currentTimeMillis();g.V().hasLabel('Person').out('owns').hasLabel('Asset').has('assetid','asset-num-7').iterate();System.currentTimeMillis()-t

// 'Reversed' query
t=System.currentTimeMillis();g.V().hasLabel('Asset').has('assetid','asset-num-7').in('owns').hasLabel('Person').iterate();System.currentTimeMillis()-t


Results for DSL-style gremlin versus reversed traversal: 2773ms versus 2ms, 
i.e. ~ x1300 faster when reversed. This result compares closely with 
Christian's results in which he saw ~ x1100 faster performance.

h2. Experiments with declarative queries
I experimented to see if I could get Titan's gremlin optimizer to translate the 
query into the obviously more efficient traversal, following Christian's 
example. Hoewever, even a declarative query, with the sub-traversals in any 
order, is not optimized:

gremlin> g.V().match(
......1> __.as('b').hasLabel('Asset').has('assetid','asset-num-38'),
......2>  __.as('a').hasLabel('Person'),
......3> __.as('a').out('owns').as('b')).
......4>  select('a','b').by('name').by('assetid')
04:20:08,513  WARN StandardTitanTx:1262 - Query requires iterating over all 
vertices [(~label = Person)]. For better performance, use indexes
==>[a:38,b:asset-num-38]

You have to explicitly reverse the edge traversal to get good performance, as 
in the following:

gremlin> g.V().match(
......1> __.as('b').hasLabel('Asset').has('assetid','asset-num-38'),
......2>  __.as('a').hasLabel('Person'),
......3> __.as('b').in('owns').as('a')).
......4>  select('a','b').by('name').by('assetid')
==>[a:38,b:asset-num-38]


h2. Exploration of Titan's index APIs
If Titan cannot perform this optimization, it looks as if Atlas would have to 
implement it, so I looked at Titan's TitanGraphIndex APIs to see how much Atlas 
would be able to derive generically, but I'm currently of the opinion that the 
Titan APIs do not provide enough to support a fully programmatic query 
optimization in Atlas. The alternative appears to be Atlas capitalizing on its 
own knowledge of the graph schema, indexes and stats. I have not explored that 
area yet.



> Highly inefficient DSL-queries
> ------------------------------
>
>                 Key: ATLAS-1868
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1868
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 0.7-incubating
>         Environment: linux, hbase + solr configuration.
>            Reporter: Christian R
>              Labels: dsl, gremlin
>
> The DSL query 'mytype where property.id = "id1"' appears to be rewritten as a 
> gremlin query that resembles:
> g.V.has(typename, 'mytype'¨).as(x).out('property').has('id', 'id1').back('x')
> On our system this query takes 6-7 minutes. The query
> g.V.has('id', 'id1').in('property').has('typename', 'mytype')
> takes 350 milliseconds.
> Our graph:
> g.V.count() = 1359151
> We have atlas 0.7 installed. I've compiled the latest 0.9 code and looked at 
> the generated gremlin query as reported in the logs for the same DSL-query, 
> and I think 0.9 has the same performance issues. Unfortunately I don't have a 
> big graph on a 0.9 installation to test performance. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ATLAS-1868) Highly inefficient DSL-queries

Reply via email to