Re: [orientdb] Slow performance when building a weighted table out of edge-vertex relationships

Jean-Sebastien Lemay Sat, 25 Apr 2015 07:42:05 -0700

Hi there,

OK -- I'm a bit surprised, in a way; this is a straightforward SELECT 
statement.


I decided to run another experiment. My scenario: determine which tags are 
a given user's favourite.

My query in OrientDB:
SELECT in.name AS name, COUNT(in.name) AS weight 
FROM UserUsedTag
WHERE "john.doe" IN out.username
AND timestamp < date()
GROUP BY in.name
ORDER BY weight DESC
Query executed in 0.487 sec. Returned 2 record(s)
soccer 5042
baseball 4958

I've built up an equivalent structure in PostgreSQL to compare performance: 
3 tables (user, tag, user_used_tag_link), with similar data.
My query in PostgreSQL:
SELECT public.tag.name, selection.weight FROM (
 SELECT tag_id, COUNT(tag_id) AS weight FROM public.user_used_tag_link 
 WHERE timestamp < CURRENT_TIMESTAMP
 AND user_id = 1
 GROUP BY tag_id
 ORDER BY weight DESC
) AS selection
JOIN public.tag ON selection.tag_id = public.tag.id
ORDER BY selection.weight DESC;
Query executes in 12ms.
"soccer";6035
"baseball";4811

Seems that I am getting significantly better performance with PostgreSQL 
(12ms vs 487ms).

Upon further digging, indeed it seems that it is simply because I am 
sifting through 11k records on OrientDB that I am getting slow performance 
(lag not caused by traversal, the SELECT on its own is slow). But surely I 
must be doing something wrong -- why is OrientDB lagging for my 
straightforward SELECT on an indexed field, although it is a database built 
for Big Data queries?

Best regards,
Jean





On Saturday, April 25, 2015 at 12:31:02 AM UTC+8, l.garulli wrote:
>
> The bottleneck there is the 11k records that match the condition: 
>
> timestamp < date('2015-04-24 09:40:00') 
>
> Best Regards,
>
> Luca Garulli
> CEO at Orient Technologies LTD
> the Company behind OrientDB
> http://about.me/luca.garulli
>
>
> On 24 April 2015 at 17:02, Jean-Sebastien Lemay <[email protected] 
> <javascript:>> wrote:
>
>> Understood about X2 - - > Y2. However I wanted to confirm if the 
>> bottleneck was due to the system navigating the edge 6000 times to retrieve 
>> the "name" field from the same vertex. 
>>
>> I'm afraid hooks are out of the question since my next query is to 
>> determine, for a specific user, which tags he used the most. I cannot start 
>> creating a table for each user for each day month year etc. Especially 
>> since I mean past [x] months from this instant, not a calendar month. 
>>
>> Can you confirm if my query is slow due to the repeated re-fetching of 
>> the "name" field of my Tag vertex? 
>> On Apr 24, 2015 10:39 PM, "Luca Garulli" <[email protected] 
>> <javascript:>> wrote:
>>
>>> Hi Jean-Sebastien,
>>> If you have X records in your resultset and execution time is Y, then 
>>> with X2 records it will be about Y2. There is not magic here.
>>>
>>> The most common solution, instead, is to pre-aggregate weight in a tree 
>>> (year->month->day->hour->minute->second),
>>> so every query takes a constant time, no matter by the number of events.
>>>
>>> You can do this by writing a Hook that automatically updates counters 
>>> upon create/update/delete.
>>>
>>> In this way the query above would be super fast.
>>>
>>> -- 
>>> Best Regards,
>>>
>>> Luca Garulli
>>> CEO at Orient Technologies LTD
>>> the Company behind OrientDB
>>> http://about.me/luca.garulli
>>>
>>>
>>> On 24 April 2015 at 13:13, Jean-Sebastien Lemay <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> To validate if OrientDB is the right fit for my project, I've created a 
>>>> simple scenario:
>>>>
>>>>    - Vertex classes (2):
>>>>       - *User*: represents a user
>>>>       - *Tag*: represents a tag that a user can interact with (e.g. 
>>>>       post a comment using that tag). Contains a *name *property
>>>>       
>>>>       - Edge classes (1):
>>>>       - *UserUsedTag*: connects a user to a tag. Contains a 
>>>>       *timestamp *property (which is indexed with NOTUNIQUE)
>>>>    
>>>> To sum it up:
>>>> *[V:User] ---[E:UserUsedTag]--> [V:Tag]*
>>>>
>>>> I'm trying to build a query that will let me know which tags have been 
>>>> the most popular in the past [x] minutes/hours/days/months...
>>>> As such, here is an example of the query I've got right now:
>>>> SELECT inV().name as name, COUNT(in) AS weight 
>>>> FROM UserUsedTag 
>>>> WHERE timestamp < date('2015-04-24 09:40:00') 
>>>> GROUP BY in 
>>>> ORDER BY weight DESC
>>>>
>>>> The query works, and I get a proper result set:
>>>> name
>>>> weight
>>>> baseball6117soccer5003My problem is the performance:
>>>> *Query executed in 0.311 sec. Returned 2 record(s)*
>>>> If it takes 1/3 of a second to sift through ~11,000 results, I can only 
>>>> imagine how crippled the performance will be if I am dealing with millions 
>>>> of edges, which I expect to end up with. In fact, as I add new edges, it 
>>>> seems like the query time increases linearly.
>>>>
>>>> Here is the EXPLAIN for the above query:
>>>> METADATAPROPERTIES
>>>> @version
>>>> resultSize
>>>> fullySortedByIndex
>>>> documentAnalyzedCompatibleClass
>>>> recordReads
>>>> fetchingFromTargetElapsed
>>>> indexIsUsedInOrderBy
>>>> compositeIndexUsed
>>>> current
>>>> documentReads
>>>> projectionElapsed
>>>> limit
>>>> orderByElapsed
>>>> evaluated
>>>> groupByElapsed
>>>> user
>>>> elapsed
>>>> resultType
>>>> involvedIndexes
>>>> 02false1112011120313false1#14:11120 
>>>> <http://172.16.11.2:2480/studio/index.html#/database/TestOrientDb/browse/edit/14:11120>
>>>> 111206-10111200#5:0 
>>>> <http://172.16.11.2:2480/studio/index.html#/database/TestOrientDb/browse/edit/5:0>
>>>> 353.77362collection["UserUsedTag.timestamp"]It seems that 
>>>> 'fetchingFromTargetElapsed' is the biggest bottleneck here. Is this due to 
>>>> the nature of my query? Perhaps my query is not optimal? I am new to graph 
>>>> DBs so I'd like to know if there is any way I can rephrase my query to end 
>>>> up with the same result.
>>>>
>>>> The problem for me is that if I was using a relational DB, I would 
>>>> probably get better performance, if I was to query my "link" table and, 
>>>> with my final two (2) rows, perform one (1) query for each to retrieve the 
>>>> name of the corresponding Tag via the foreign key. Is there any way I can 
>>>> split my query similarly with OrientDB, instead of fetching the same Tag 
>>>> name multiple times?
>>>>
>>>>  -- 
>>>>
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OrientDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>>
>>>  -- 
>>>
>>> --- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "OrientDB" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/orient-database/rLBqzmENmcU/unsubscribe
>>> .
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected] <javascript:>.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>  -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Slow performance when building a weighted table out of edge-vertex relationships

Reply via email to