[arangodb-google] Re: Question about query and index usage

Jan Fri, 21 Apr 2017 05:51:33 -0700

Hi Thomas,

that query looks ok, but I think it does more work than necessary.
It uses the index for finding all documents that have a compoundField value 
>= "userId_". That will create an index scan with the correct lower bound, 
but an open ended upper bound.
So it will read a lot of documents not needed and post-filter them using 
the extra LIKE filter expression.


A better way would be to set the upper bound too, e.g. 

for s in shares
filter s.compoundField >= 'userId_minValueForTime'
filter s.compoundField <= 'userId_maxValueForTime'
sort s.compoundField desc
return s

That will avoid post-filtering and only return the documents from the index 
that are actually needed. 
Of course "minValueForTime" and "maxValueForTime" need to be replaced with 
the min and max possible values for the time part. If ISO datetime strings, 
it could look like this:

for s in shares
filter s.compoundField >= 'userId_0000-00-00'
filter s.compoundField <= 'userId_9999-99-99'
sort s.compoundField desc
return s

These bounds should be sufficient to include all possible date ranges.

Another possibility is to not use a compound field, but use two separate 
fields "userId" and "time" and index them with the skiplist index on 
["userId", "time"].
The query then becomes

for s in shares
filter s.userId == 'userId'
sort s.time desc
return s

That should also allow usage of the skiplist index.
Best regards
Jan




Am Mittwoch, 19. April 2017 06:38:34 UTC+2 schrieb Thomas Weiss:
>
> Trying to answer my own question, I'm looking for a way to efficiently:
> 1. list all things shared with a user
> 2. sort those things by a date
> and I don't want the complexity of this query to increase with the amount 
> of data that's in the DB (it's a social network feed so it will always 
> grow). As explained in the previous message, a naive approach (iterating on 
> users, then iterating on their publications) would always consume more CPU 
> and memory.
>
> My idea now is to denormalize the shares in a normal (not edge) 
> collection, where for each share I would store a document with a field 
> composed of: {userId}_{time of the share} and create a skiplist index on 
> that field.
> The query would then be:
> for s in shares
> filter s.compoundField >= 'userId_'
> sort s.compoundField desc
> filter LIKE(s.compoundField, 'userId_%')
> return s
>
> The explanation of the query looks good, as it seems to directly use the 
> skiplist to target the userId AND sort. Your comments would be very 
> appreciated here!
>
> Thanks,
> Thomas
>
> On Sunday, April 9, 2017 at 8:19:21 PM UTC+8, Thomas Weiss wrote:
>>
>> Working on a social network project, I've recently realized that the way 
>> I was fetching users' feed was not scalable:
>> for f in 1 outbound 'users/<user-id>' follows
>> for c, p in 1 outbound f hasPublished
>> sort p.publishedAt
>> return c
>> As the number of users and published content grows, this query would 
>> always consume more CPU and memory! 
>>
>> So in order to optimize that, I've started to denormalize my data by 
>> using a 'isSharedWith' edge collection that links users to published 
>> content and has a skiplist index on a field named 'lastPublishedAt'.
>> So now my query looks like:
>> for c, s in 1 inbound 'users/<user-id>' isSharedWith
>> sort s.lastPublishedAt desc
>> return c
>>
>> The "explanation" of this query is:
>> Execution plan:
>>  Id   NodeType          Est.   Comment
>>   1   SingletonNode        1   * ROOT
>>   2   TraversalNode       84     - FOR c  /* vertex */, s  /* edge */ IN 
>> 1..1  /* min..maxPathDepth */ INBOUND 'users/283139442928' /* startnode 
>> */  isSharedWith
>>   3   CalculationNode     84     - LET #4 = s.`lastPublishedAt`   /* 
>> attribute expression */
>>   4   SortNode            84     - SORT #4 DESC
>>   5   ReturnNode          84     - RETURN c
>>
>>
>> Indexes used:
>>  By   Type   Collection     Unique   Sparse   Selectivity   Fields       
>>         Ranges
>>   2   edge   isSharedWith   false    false         1.18 %   [ `_from`, 
>> `_to` ]   base INBOUND
>>
>>
>> Traversals on graphs:
>>  Id   Depth   Vertex collections   Edge collections   Options           
>>                         Filter conditions
>>   2   1..1                         isSharedWith       uniqueVertices: 
>> none, uniqueEdges: path   
>>
>>
>> Optimization rules applied:
>>  none
>>
>> But this still doesn't look good to me; it seems that a full traversal is 
>> first performed in order to retrieve lastPublishedAt, and then sort on that 
>> field.
>>
>> So my question is, would there be a way to denormalize + query that kind 
>> of data in order to make sure that the complexity of the query (getting the 
>> x most recent elements) doesn't grow with the amount of data?
>>
>> Thanks in advance,
>> Thomas
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Question about query and index usage

Reply via email to