[arangodb-google] Re: Question about query and index usage

Thomas Weiss Thu, 13 Jul 2017 20:50:07 -0700

Hi Jan,

Sorry to dig up an old post but I've a follow up question.


Your last recommendation worked well: by creating a skiplist on ['userId', 
'time'] I could use that index efficiently by querying:
for s in shares
filter s.userId == 'user-id'
sort s.time desc
return s

Now I tried to achieve the same with an edge collection, by creating a 
skiplist on ['_from', 'time'] but this query:
for s in edgeColl
filter s._from == 'user-id'
sort s.time desc
return s
doesn't use that skiplist index, it only uses the default ['_from', '_to'] 
edge index.

Is there a way to force the usage of my skiplist?

Thanks,
Thomas

On Friday, April 21, 2017 at 8:51:01 PM UTC+8, Jan wrote:

> Hi Thomas,
>
> that query looks ok, but I think it does more work than necessary.
> It uses the index for finding all documents that have a compoundField 
> value >= "userId_". That will create an index scan with the correct lower 
> bound, but an open ended upper bound.
> So it will read a lot of documents not needed and post-filter them using 
> the extra LIKE filter expression.
>
> A better way would be to set the upper bound too, e.g. 
>
> for s in shares
> filter s.compoundField >= 'userId_minValueForTime'
> filter s.compoundField <= 'userId_maxValueForTime'
> sort s.compoundField desc
> return s
>
> That will avoid post-filtering and only return the documents from the 
> index that are actually needed. 
> Of course "minValueForTime" and "maxValueForTime" need to be replaced with 
> the min and max possible values for the time part. If ISO datetime strings, 
> it could look like this:
>
> for s in shares
> filter s.compoundField >= 'userId_0000-00-00'
> filter s.compoundField <= 'userId_9999-99-99'
> sort s.compoundField desc
> return s
>
> These bounds should be sufficient to include all possible date ranges.
>
> Another possibility is to not use a compound field, but use two separate 
> fields "userId" and "time" and index them with the skiplist index on 
> ["userId", "time"].
> The query then becomes
>
> for s in shares
> filter s.userId == 'userId'
> sort s.time desc
> return s
>
> That should also allow usage of the skiplist index.
> Best regards
> Jan
>
>
>
>
> Am Mittwoch, 19. April 2017 06:38:34 UTC+2 schrieb Thomas Weiss:
>>
>> Trying to answer my own question, I'm looking for a way to efficiently:
>> 1. list all things shared with a user
>> 2. sort those things by a date
>> and I don't want the complexity of this query to increase with the amount 
>> of data that's in the DB (it's a social network feed so it will always 
>> grow). As explained in the previous message, a naive approach (iterating on 
>> users, then iterating on their publications) would always consume more CPU 
>> and memory.
>>
>> My idea now is to denormalize the shares in a normal (not edge) 
>> collection, where for each share I would store a document with a field 
>> composed of: {userId}_{time of the share} and create a skiplist index on 
>> that field.
>> The query would then be:
>> for s in shares
>> filter s.compoundField >= 'userId_'
>> sort s.compoundField desc
>> filter LIKE(s.compoundField, 'userId_%')
>> return s
>>
>> The explanation of the query looks good, as it seems to directly use the 
>> skiplist to target the userId AND sort. Your comments would be very 
>> appreciated here!
>>
>> Thanks,
>> Thomas
>>
>> On Sunday, April 9, 2017 at 8:19:21 PM UTC+8, Thomas Weiss wrote:
>>>
>>> Working on a social network project, I've recently realized that the way 
>>> I was fetching users' feed was not scalable:
>>> for f in 1 outbound 'users/<user-id>' follows
>>> for c, p in 1 outbound f hasPublished
>>> sort p.publishedAt
>>> return c
>>> As the number of users and published content grows, this query would 
>>> always consume more CPU and memory! 
>>>
>>> So in order to optimize that, I've started to denormalize my data by 
>>> using a 'isSharedWith' edge collection that links users to published 
>>> content and has a skiplist index on a field named 'lastPublishedAt'.
>>> So now my query looks like:
>>> for c, s in 1 inbound 'users/<user-id>' isSharedWith
>>> sort s.lastPublishedAt desc
>>> return c
>>>
>>> The "explanation" of this query is:
>>> Execution plan:
>>>  Id   NodeType          Est.   Comment
>>>   1   SingletonNode        1   * ROOT
>>>   2   TraversalNode       84     - FOR c  /* vertex */, s  /* edge */ 
>>> IN 1..1  /* min..maxPathDepth */ INBOUND 'users/283139442928' /* 
>>> startnode */  isSharedWith
>>>   3   CalculationNode     84     - LET #4 = s.`lastPublishedAt`   /* 
>>> attribute expression */
>>>   4   SortNode            84     - SORT #4 DESC
>>>   5   ReturnNode          84     - RETURN c
>>>
>>>
>>> Indexes used:
>>>  By   Type   Collection     Unique   Sparse   Selectivity   Fields     
>>>           Ranges
>>>   2   edge   isSharedWith   false    false         1.18 %   [ `_from`, 
>>> `_to` ]   base INBOUND
>>>
>>>
>>> Traversals on graphs:
>>>  Id   Depth   Vertex collections   Edge collections   Options           
>>>                         Filter conditions
>>>   2   1..1                         isSharedWith       uniqueVertices: 
>>> none, uniqueEdges: path   
>>>
>>>
>>> Optimization rules applied:
>>>  none
>>>
>>> But this still doesn't look good to me; it seems that a full traversal 
>>> is first performed in order to retrieve lastPublishedAt, and then sort on 
>>> that field.
>>>
>>> So my question is, would there be a way to denormalize + query that kind 
>>> of data in order to make sure that the complexity of the query (getting the 
>>> x most recent elements) doesn't grow with the amount of data?
>>>
>>> Thanks in advance,
>>> Thomas
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Question about query and index usage

Reply via email to