[arangodb-google] Re: Faceted Search Performance

2017-09-20 Thread Roman Kuzmik
done

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-20 Thread Jan
Thanks for the feedback!
Sounds good. I also hope the changes can be integrated into a stable 
version soon.

Best regards
Jan

Am Mittwoch, 20. September 2017 00:54:44 UTC+2 schrieb Roman Kuzmik:
>
> Thanks for a hint!
>
> We have wrote small service faced to calculate facets.
> It split my huge AQL provided above into 5 queries:
>
>- main - to filter, sort and retrieve matching entities:
>   - 
>   
>   LET docs = (FOR a IN Asset
>   
>FILTER a.name like 'test-asset-%'
>   
>SORT a.name
>   
>   RETURN a)
>   
>   RETURN {
>   
>counts: (RETURN {
>   
>  total: LENGTH(docs),
>   
>  offset: 0,
>   
>  to: 5
>   
>}),
>   
>items: (FOR a IN docs LIMIT 0, 5 RETURN a)
>   
>   }
>   
>   - 4 small ones to purely calculate facets:
>   - 
>   
>   LET docs = (FOR a IN Asset
>   
>FILTER a.name like 'test-asset-%'
>   
>   RETURN a)
>   
>   LET attributeX = (
>   
>   FOR a in docs
>   
>COLLECT attr = a.attributeX WITH COUNT INTO length
>   
>   RETURN { value: attr, count: length}
>   
>   )
>   
>   RETURN {
>   
>counts: (RETURN {
>   
>  total: LENGTH(docs),
>   
>  offset: 0,
>   
>  to: -1,
>   
>  facets: {
>   
>attributeX: {
>   
>  from: 0,
>   
>  to: 1000,
>   
>  total: LENGTH(attributeX)
>   
>}
>   
>  }
>   
>}),
>   
>facets: {
>   
>  attributeX: (FOR a in attributeX LIMIT 0, 1000 return a)
>   
>}
>   
>   }
>   
> We run these using Java's 8 Fork/Join and basically execute Map(split into 
> sub-queries)/Reduce(merge results) potentially against ArangoDB cluster.
> We run custom ArangoDB build from Jan's feature branch. And results are 
> pretty impressive.
> Remember, we have started with 140 secs for the full AQL and now we are 
> down to 11 seconds (with sort/filter + skiplist on name) or 4 seconds (w/o 
> soft/filter and 4 facets).
>
> I think, we are satisfied for now :-) and hope this PR will make it to 
> 'master'.
>
> Also, would be awesome to see if AQL-pipeline will be advanced in the 
> future to accomodate more analytical type of queries (facets with 
> sub-facets, ElasticSearch style of sub-grouping, etc)
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-19 Thread Roman Kuzmik
Thanks for a hint!

We have wrote small service faced to calculate facets.
It split my huge AQL provided above into 5 queries:

   - main - to filter, sort and retrieve matching entities:
  - 
  
  LET docs = (FOR a IN Asset
  
   FILTER a.name like 'test-asset-%'
  
   SORT a.name
  
  RETURN a)
  
  RETURN {
  
   counts: (RETURN {
  
 total: LENGTH(docs),
  
 offset: 0,
  
 to: 5
  
   }),
  
   items: (FOR a IN docs LIMIT 0, 5 RETURN a)
  
  }
  
  - 4 small ones to purely calculate facets:
  - 
  
  LET docs = (FOR a IN Asset
  
   FILTER a.name like 'test-asset-%'
  
  RETURN a)
  
  LET attributeX = (
  
  FOR a in docs
  
   COLLECT attr = a.attributeX WITH COUNT INTO length
  
  RETURN { value: attr, count: length}
  
  )
  
  RETURN {
  
   counts: (RETURN {
  
 total: LENGTH(docs),
  
 offset: 0,
  
 to: -1,
  
 facets: {
  
   attributeX: {
  
 from: 0,
  
 to: 1000,
  
 total: LENGTH(attributeX)
  
   }
  
 }
  
   }),
  
   facets: {
  
 attributeX: (FOR a in attributeX LIMIT 0, 1000 return a)
  
   }
  
  }
  
We run these using Java's 8 Fork/Join and basically execute Map(split into 
sub-queries)/Reduce(merge results) potentially against ArangoDB cluster.
We run custom ArangoDB build from Jan's feature branch. And results are 
pretty impressive.
Remember, we have started with 140 secs for the full AQL and now we are 
down to 11 seconds (with sort/filter + skiplist on name) or 4 seconds (w/o 
soft/filter and 4 facets).

I think, we are satisfied for now :-) and hope this PR will make it to 
'master'.

Also, would be awesome to see if AQL-pipeline will be advanced in the 
future to accomodate more analytical type of queries (facets with 
sub-facets, ElasticSearch style of sub-grouping, etc)

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-18 Thread Jan
Hi,

yes, it will scan the collection 4 times with the query below, once for 
each subquery.
Short-term I do not see any way to speed that up with existing AQL.
Long-term a few ways to handle this would be to parallelize the scanning or 
to collect and count the same input by multiple criteria with a single 
COLLECT clause. 
Parallelizing it would still mean 4 scans, but in parallel. That could 
reduce latency quite a bit. Collecting and counting the same input by 
multiple criteria looks a bit more attractive, but does not really fit into 
the current way the AQL pipelines are set up.

Another alternative is to split the query on the client side and run 4 
individual AQL queries, collect their results and aggregate them on the 
client side. That will work, but requires having quite some more logic in 
the client/application code.

Best regards
Jan 


Am Montag, 18. September 2017 22:50:20 UTC+2 schrieb Roman Kuzmik:
>
> 4 seconds per facet, thus adding 3 more it takes us to 16 seconds.
> btw, why is that, arango is doing full scan anyways. is it doing it 4 
> times with the query bellow? Is there any way to make it smarter?
>
>  LET docs = (FOR a IN Asset 
>  RETURN a)
> LET attribute1 = (
>  FOR a in docs 
>   COLLECT attr = a.attribute1 WITH COUNT INTO length
>  RETURN { value: attr, count: length}
> )
> LET attribute2 = (
>  FOR a in docs 
>   COLLECT attr = a.attribute2 WITH COUNT INTO length
>  RETURN { value: attr, count: length}
> )
> LET attribute3 = (
>  FOR a in docs 
>   COLLECT attr = a.attribute3 WITH COUNT INTO length
>  RETURN { value: attr, count: length}
> )
> LET attribute4 = (
>  FOR a in docs 
>   COLLECT attr = a.attribute4 WITH COUNT INTO length
>  RETURN { value: attr, count: length}
> )
> RETURN {
>   counts: (RETURN {
> total: LENGTH(docs), 
> offset: 2, 
> to: 4, 
> facets: {
>   attribute1: {
> from: 0, 
> to: 5,
> total: LENGTH(attribute1)
>   },
>   attribute2: {
> from: 5, 
> to: 10,
> total: LENGTH(attribute2)
>   },
>   attribute3: {
> from: 0, 
> to: 1000,
> total: LENGTH(attribute3)
>   },
>   attribute4: {
> from: 0, 
> to: 1000,
> total: LENGTH(attribute4)
>   }
> }
>   }),
>   items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
>   facets: {
> attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
> attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
> attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
> attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 
> return a)
>}
> }
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-18 Thread Roman Kuzmik
4 seconds per facet, thus adding 3 more it takes us to 16 seconds.
btw, why is that, arango is doing full scan anyways. is it doing it 4 times 
with the query bellow? Is there any way to make it smarter?

 LET docs = (FOR a IN Asset 
 RETURN a)
LET attribute1 = (
 FOR a in docs 
  COLLECT attr = a.attribute1 WITH COUNT INTO length
 RETURN { value: attr, count: length}
)
LET attribute2 = (
 FOR a in docs 
  COLLECT attr = a.attribute2 WITH COUNT INTO length
 RETURN { value: attr, count: length}
)
LET attribute3 = (
 FOR a in docs 
  COLLECT attr = a.attribute3 WITH COUNT INTO length
 RETURN { value: attr, count: length}
)
LET attribute4 = (
 FOR a in docs 
  COLLECT attr = a.attribute4 WITH COUNT INTO length
 RETURN { value: attr, count: length}
)
RETURN {
  counts: (RETURN {
total: LENGTH(docs), 
offset: 2, 
to: 4, 
facets: {
  attribute1: {
from: 0, 
to: 5,
total: LENGTH(attribute1)
  },
  attribute2: {
from: 5, 
to: 10,
total: LENGTH(attribute2)
  },
  attribute3: {
from: 0, 
to: 1000,
total: LENGTH(attribute3)
  },
  attribute4: {
from: 0, 
to: 1000,
total: LENGTH(attribute4)
  }
}
  }),
  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
  facets: {
attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 
return a)
   }
}


-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-18 Thread Roman Kuzmik
compiled your changes from feature/mmfiles-hash-lookup-performance
indeed, on single facet we are down to 4 seconds from 6 seconds (in the 
test case provided above). And no indexes needed.

hope it will make to master soon.

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-15 Thread Jan
Hi,

I have profiled the execution of a single-facet query with the MMFiles 
engine this morning and I think we will be able to reduce the execution 
time quite a bit.
In my local tests, I have seen improvements of around 40% in case there is 
a full collection scan (no indexes) and there is a collect/count on a 
single attribute. 
Here is the PR with the changes for reference: 
https://github.com/arangodb/arangodb/pull/3265

However, even with those changes applied (currently they are contained in 
the PR only) the combined execution time for multiple facet queries will 
not get below one second.
But at least it would be a step in the right direction, and other queries 
in MMFiles should also benefit from this.

Best regards
Jan


Am Freitag, 15. September 2017 05:51:32 UTC+2 schrieb Roman Kuzmik:
>
> Btw, Fyi, first query (AKA: LENGTH(g)) with an index on attribute1 runs 
> almost same as second query (AKA: WITH COUNT).
> Here, 2nd query with an index takes 4.4 seconds.
> But it is still, just one facet. Usually you need a bunch, like in my 
> "long" query in very first post.
> Let me re-write it using WITH COUNT and create an index on each facet 
> will see how long it will take. Currently, "as-is" it takes 117 seconds.
>
> Thanks for looking into this issue.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-14 Thread Roman Kuzmik
Btw, Fyi, first query (AKA: LENGTH(g)) with an index on attribute1 runs 
almost same as second query (AKA: WITH COUNT).
Here, 2nd query with an index takes 4.4 seconds.
But it is still, just one facet. Usually you need a bunch, like in my 
"long" query in very first post.
Let me re-write it using WITH COUNT and create an index on each facet 
will see how long it will take. Currently, "as-is" it takes 117 seconds.

Thanks for looking into this issue.

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-14 Thread Jan
Hi,

I tried it myself on my local laptop, and here are the results:

Original query:

   FOR a IN Asset COLLECT attr = a.attribute1 INTO g RETURN { value: attr, 
count: length(g) }

This executes in about 35 seconds with the 8M documents. The execution plan 
is not ideal, because it will sort the entire collection first:

Execution plan:
 Id   NodeType Est.   Comment
  1   SingletonNode   1   * ROOT
  2   EnumerateCollectionNode   800 - FOR a IN p   /* full 
collection scan */
  3   CalculationNode   800   - LET #3 = a.`attribute1`   
/* attribute expression */   /* collections used: a : p */
  7   SortNode  800   - SORT #3 ASC
  4   CollectNode   640   - COLLECT attr = #3 INTO g   
/* sorted */
  5   CalculationNode   640   - LET #5 = { "value" : attr, 
"count" : LENGTH(g) }   /* simple expression */
  6   ReturnNode640   - RETURN #5


Adjusted query:

FOR a IN Asset COLLECT value = a.attribute1 WITH COUNT INTO length 
RETURN { value, length }

Changing the query as I suggested makes it finish in 6.x seconds. The 
execution plan is better already:

Execution plan:
 Id   NodeType Est.   Comment
  1   SingletonNode   1   * ROOT
  2   EnumerateCollectionNode   800 - FOR a IN p   /* full 
collection scan */
  3   CalculationNode   800   - LET #3 = a.`attribute1`   
/* attribute expression */   /* collections used: a : p */
  4   CollectNode   640   - COLLECT value = #3 WITH 
COUNT INTO length   /* hash */
  7   SortNode  640   - SORT value ASC
  5   CalculationNode   640   - LET #5 = { "value" : value, 
"length" : length }   /* simple expression */
  6   ReturnNode640   - RETURN #5

With a sorted (skiplist) index on "attribute1", the execution time goes 
down to around 5.3 seconds, and the plan changes to:

Execution plan:
 Id   NodeType Est.   Comment
  1   SingletonNode   1   * ROOT
 10   IndexNode 800 - FOR a IN p   /* skiplist index scan */
  3   CalculationNode   800   - LET #3 = a.`attribute1`   /* 
attribute expression */   /* collections used: a : p */
  4   CollectNode   640   - COLLECT value = #3 WITH COUNT INTO 
length   /* sorted */
  7   CalculationNode   640   - LET #7 = { "value" : value, 
"length" : length }   /* simple expression */
  8   ReturnNode640   - RETURN #7

Still not ideal, but already better than the initial 30+ seconds.
That's all that right now comes to my mind that can easily be done for this 
particular query.
Maybe there is something more that can be optimized here, but this may 
require changes to the code.

Best regards
Jan

Am Donnerstag, 14. September 2017 20:38:31 UTC+2 schrieb Roman Kuzmik:
>
> Thanks Jan for your reply!
>
> But, yes, we have tried "2.x old school" approach* WITH COUNT*, as well 
> as brand new* DISTINCT*.
> Both yields similar sluggish results :-/
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-14 Thread Roman Kuzmik
Thanks Jan for your reply!

But, yes, we have tried "2.x old school" approach* WITH COUNT*, as well as 
brand new* DISTINCT*.
Both yields similar sluggish results :-/

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[arangodb-google] Re: Faceted Search Performance

2017-09-14 Thread Jan
Hi,

one of the things to do for improving the query performance is to get rid 
of the "INTO" clause, as "INTO" will copy all documents found per group 
into a new variable "g":

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

The query without "INTO" would look like this:

 FOR a in Asset 
  COLLECT value = a.attribute1 WITH COUNT INTO length
 RETURN { value, length }

It should be faster than the one with "INTO", but I am not sure how much. 
This probably depends on the actual data.

Can you give it a try?
Thanks
Jan




Am Donnerstag, 14. September 2017 15:53:20 UTC+2 schrieb Roman Kuzmik:
>
> We are evaluating ArangoDB performance in space of facets calculations.
> There are number of other products capable of doing the same, either via 
> special API  or query language:
>
>- MarkLogic Facets 
>- ElasticSearch Aggregations 
>
> 
>- Solr Faceting 
>
>- etc
>
> We understand, there is no special API in Arango to calculate factes 
> explicitly.
> But in reality, it is not needed, thanks for a comprehensive AQL it can be 
> easily achieved via simple query, like:
>
>  FOR a in Asset 
>   COLLECT attr = a.attribute1 INTO g
>  RETURN { value: attr, count: length(g) }
>
> This query calculate a facet on *attribute1* and yields frequency in the 
> form of:
>
> [
>   {
> "value": "test-attr1-1",
> "count": 200
>   },
>   {
> "value": "test-attr1-2",
> "count": 200
>   },
>   {
> "value": "test-attr1-3",
> "count": 300
>   }
> ]
>
>
> It is saying, that across my entire collection *attribute1* took three 
> forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts 
> provided.
> Pretty much we run a DISTINCT query and aggregated counts.
>
> Looks simple and clean. With only one, but really big issue - 
> *performance*.
>
> Provided query above runs for !*31 seconds*! on top of the test 
> collection with only *8M* documents.
> We have experimented with different index types, storage engines (with 
> rocksdb and without), investigating explanation plans at no avail.
> Test documents we use in this test are very concise with only three short 
> attributes.
>
> We would appreciate any input at this point.
> Either we doing something wrong. Or ArangoDB simply is not designed to 
> perform in this particular area.
>
> btw, ultimate goal would be to run something like the following in 
> under-second time:
>
> LET docs = (FOR a IN Asset 
>
>   FILTER a.name like 'test-asset-%'
>
>   SORT a.name
>
>  RETURN a)
>
> LET attribute1 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute1 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> LET attribute2 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute2 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> LET attribute3 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute3 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> LET attribute4 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute4 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> RETURN {
>
>   counts: (RETURN {
>
> total: LENGTH(docs), 
>
> offset: 2, 
>
> to: 4, 
>
> facets: {
>
>   attribute1: {
>
> from: 0, 
>
> to: 5,
>
> total: LENGTH(attribute1)
>
>   },
>
>   attribute2: {
>
> from: 5, 
>
> to: 10,
>
> total: LENGTH(attribute2)
>
>   },
>
>   attribute3: {
>
> from: 0, 
>
> to: 1000,
>
> total: LENGTH(attribute3)
>
>   },
>
>   attribute4: {
>
> from: 0, 
>
> to: 1000,
>
> total: LENGTH(attribute4)
>
>   }
>
> }
>
>   }),
>
>   items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
>
>   facets: {
>
> attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
>
> attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
>
> attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
>
> attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 
> return a)
>
>}
>
> }
>
> Thanks!
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.