Hi Jan,
Sorry to dig up an old post but I've a follow up question.
Your last recommendation worked well: by creating a skiplist on ['userId',
'time'] I could use that index efficiently by querying:
for s in shares
filter s.userId == 'user-id'
sort s.time desc
return s
Now I tried to achieve the same with an edge collection, by creating a
skiplist on ['_from', 'time'] but this query:
for s in edgeColl
filter s._from == 'user-id'
sort s.time desc
return s
doesn't use that skiplist index, it only uses the default ['_from', '_to']
edge index.
Is there a way to force the usage of my skiplist?
Thanks,
Thomas
On Friday, April 21, 2017 at 8:51:01 PM UTC+8, Jan wrote:
> Hi Thomas,
>
> that query looks ok, but I think it does more work than necessary.
> It uses the index for finding all documents that have a compoundField
> value >= "userId_". That will create an index scan with the correct lower
> bound, but an open ended upper bound.
> So it will read a lot of documents not needed and post-filter them using
> the extra LIKE filter expression.
>
> A better way would be to set the upper bound too, e.g.
>
> for s in shares
> filter s.compoundField >= 'userId_minValueForTime'
> filter s.compoundField <= 'userId_maxValueForTime'
> sort s.compoundField desc
> return s
>
> That will avoid post-filtering and only return the documents from the
> index that are actually needed.
> Of course "minValueForTime" and "maxValueForTime" need to be replaced with
> the min and max possible values for the time part. If ISO datetime strings,
> it could look like this:
>
> for s in shares
> filter s.compoundField >= 'userId_0000-00-00'
> filter s.compoundField <= 'userId_9999-99-99'
> sort s.compoundField desc
> return s
>
> These bounds should be sufficient to include all possible date ranges.
>
> Another possibility is to not use a compound field, but use two separate
> fields "userId" and "time" and index them with the skiplist index on
> ["userId", "time"].
> The query then becomes
>
> for s in shares
> filter s.userId == 'userId'
> sort s.time desc
> return s
>
> That should also allow usage of the skiplist index.
> Best regards
> Jan
>
>
>
>
> Am Mittwoch, 19. April 2017 06:38:34 UTC+2 schrieb Thomas Weiss:
>>
>> Trying to answer my own question, I'm looking for a way to efficiently:
>> 1. list all things shared with a user
>> 2. sort those things by a date
>> and I don't want the complexity of this query to increase with the amount
>> of data that's in the DB (it's a social network feed so it will always
>> grow). As explained in the previous message, a naive approach (iterating on
>> users, then iterating on their publications) would always consume more CPU
>> and memory.
>>
>> My idea now is to denormalize the shares in a normal (not edge)
>> collection, where for each share I would store a document with a field
>> composed of: {userId}_{time of the share} and create a skiplist index on
>> that field.
>> The query would then be:
>> for s in shares
>> filter s.compoundField >= 'userId_'
>> sort s.compoundField desc
>> filter LIKE(s.compoundField, 'userId_%')
>> return s
>>
>> The explanation of the query looks good, as it seems to directly use the
>> skiplist to target the userId AND sort. Your comments would be very
>> appreciated here!
>>
>> Thanks,
>> Thomas
>>
>> On Sunday, April 9, 2017 at 8:19:21 PM UTC+8, Thomas Weiss wrote:
>>>
>>> Working on a social network project, I've recently realized that the way
>>> I was fetching users' feed was not scalable:
>>> for f in 1 outbound 'users/<user-id>' follows
>>> for c, p in 1 outbound f hasPublished
>>> sort p.publishedAt
>>> return c
>>> As the number of users and published content grows, this query would
>>> always consume more CPU and memory!
>>>
>>> So in order to optimize that, I've started to denormalize my data by
>>> using a 'isSharedWith' edge collection that links users to published
>>> content and has a skiplist index on a field named 'lastPublishedAt'.
>>> So now my query looks like:
>>> for c, s in 1 inbound 'users/<user-id>' isSharedWith
>>> sort s.lastPublishedAt desc
>>> return c
>>>
>>> The "explanation" of this query is:
>>> Execution plan:
>>> Id NodeType Est. Comment
>>> 1 SingletonNode 1 * ROOT
>>> 2 TraversalNode 84 - FOR c /* vertex */, s /* edge */
>>> IN 1..1 /* min..maxPathDepth */ INBOUND 'users/283139442928' /*
>>> startnode */ isSharedWith
>>> 3 CalculationNode 84 - LET #4 = s.`lastPublishedAt` /*
>>> attribute expression */
>>> 4 SortNode 84 - SORT #4 DESC
>>> 5 ReturnNode 84 - RETURN c
>>>
>>>
>>> Indexes used:
>>> By Type Collection Unique Sparse Selectivity Fields
>>> Ranges
>>> 2 edge isSharedWith false false 1.18 % [ `_from`,
>>> `_to` ] base INBOUND
>>>
>>>
>>> Traversals on graphs:
>>> Id Depth Vertex collections Edge collections Options
>>> Filter conditions
>>> 2 1..1 isSharedWith uniqueVertices:
>>> none, uniqueEdges: path
>>>
>>>
>>> Optimization rules applied:
>>> none
>>>
>>> But this still doesn't look good to me; it seems that a full traversal
>>> is first performed in order to retrieve lastPublishedAt, and then sort on
>>> that field.
>>>
>>> So my question is, would there be a way to denormalize + query that kind
>>> of data in order to make sure that the complexity of the query (getting the
>>> x most recent elements) doesn't grow with the amount of data?
>>>
>>> Thanks in advance,
>>> Thomas
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.