Update: Re-reading what I just wrote, I figured a quick way to do this was 
to export my relational term_doc into a text file, and for each line 
produce the corresponding CREATE EDGE statement. I can imagine CREATE EDGE 
is slower when using subqueries, but in became reasonable as soon as I 
created indices on both term and doc classes.
So I guess this is how I'll run my evaluation for now.

I would still like your help for the following though:

Suppose I create all edges, and suppose each edge gets assigned a weight 
property (I know, this will materialise the edges and make them slower, but 
I really need properties).

Suppose also that each term-doc Vertex pair can have more than one edge (of 
the same class - is this allowed?), each with a different weight:

CREATE EDGE occurs_in from (select from term where id=0) to (select from 
doc where id=10) set weigth = 3
CREATE EDGE occurs_in from (select from term where id=0) to (select from 
doc where id=10) set weight = 5
CREATE EDGE occurs_in from (select from term where id=0) to (select from 
doc where id=11) set weight = 2
etc

How can I get (efficiently), the SUM of all weights for the edges between 
each unique combination of term and doc vertices?

So in the example above I should get 8 for the pair (term=0, doc=10), and 2 
for the pair (term=0, doc=11)


Roberto

On Thursday, 2 October 2014 12:08:46 UTC+2, Roberto Cornacchia wrote:
>
> So, suppose I start from the relational tables in my original post.
>
> I guess, if I tried to translate that by hand, I'd get something like:
> -- all terms
> CREATE CLASS term extends V;
> CREATE property id integer;
> CREATE property term string;
> INSERT INTO term VALUES (0, 'OrientDB');
> INSERT INTO term VALUES (1, 'is');
> INSERT INTO term VALUES (2, 'cool');
>
> -- all docs
> CREATE CLASS doc extends V;
> CREATE property id integer;
> CREATE property title string;
> INSERT INTO doc VALUES (10, 'manual');
> INSERT INTO doc VALUES (11, 'license');
>
> -- the edges
> CREATE CLASS occurs_in extends E
> CREATE EDGE occurs_in from (select from term where id=0) to (select from 
> doc where id=10)
> CREATE EDGE occurs_in from (select from term where id=0) to (select from 
> doc where id=11)
> CREATE EDGE occurs_in from (select from term where id=1) to (select from 
> doc where id=10)
> CREATE EDGE occurs_in from (select from term where id=2) to (select from
>  doc where id=10)
>
> This is how I would do it for a small example like this. 
> But in a real scenario, when the relationships come from a stored (large) 
> table (term_doc above) - this is a standard many-to-many relationship 
> schema - I don't know how to construct the edges (either with SQL or with 
> the ETL processor): somehow I need 2 joins (term -> term_doc <- doc) in 
> order to connect the vertices of term and doc.
>
>
> Background: the reason why I'm trying to do this is to replicate real data 
> with real data distributions and sizes, from a relational DB into OrientDB, 
> in order to run some tests and see how it would fit my needs (where it 
> would be slower/faster, which features would be available/missing, etc). 
> Obviously, if I want to compare the two, I need to map the problem into 
> the graph space, and not use OrientDB in a relational way (as I mentioned 
> above, I tried this and as one could expected the performance loss is huge).
>
> Roberto
>
> On Thursday, 2 October 2014 10:35:44 UTC+2, Lvc@ wrote:
>>
>> Hi Roberto,
>> Could you provide a simple example of what you need in terms of source 
>> tables and expected result in OrientDB?
>>
>> Lvc@
>> ᐧ
>>
>> On 2 October 2014 09:56, Roberto Cornacchia <[email protected]> 
>> wrote:
>>
>>> Thanks Luca,
>>>
>>> I just had a look at the documentation for ETL.
>>> If I understand correctly, it's not possible to start from the 3 
>>> separate tables like above (term, doc, term_doc), as the joinFieldName 
>>> needs to refer to a field of one of the Vertex tables (either term or doc 
>>> in this case). 
>>> Is that correct? Or can this still be done?
>>> So I'd have to join the tables before using the ETL module?
>>>
>>>
>>> Best, Roberto
>>>
>>> On Wednesday, 1 October 2014 18:47:28 UTC+2, Lvc@ wrote:
>>>>
>>>> Hi Roberto,
>>>> I suggest you to use the ETL module: much more powerful and fast:
>>>>
>>>> http://www.orientechnologies.com/docs/last/orientdb-etl.
>>>> wiki/Import-from-DBMS.html
>>>>
>>>> Lvc@
>>>>
>>>> ᐧ
>>>>
>>>> On 1 October 2014 17:56, Roberto Cornacchia <[email protected]> 
>>>> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'm trying to approach OrientDB, coming from relational background.
>>>>>
>>>>> I've already looked at this: http://www.orientechnologies.com/docs/
>>>>> last/orientdb.wiki/Import-RDBMS-to-Graph-Model.html
>>>>>
>>>>> However, I'm not sure how I would do this when I start from a 
>>>>> many-to-many relationship.
>>>>>
>>>>> *Example (term-doc matrix as used in information retrieval - which 
>>>>> term occurs in which document)*
>>>>>
>>>>> -- all terms
>>>>> CREATE TABLE term (id INTEGER, term STRING);
>>>>> INSERT INTO term VALUES (0, 'OrientDB');
>>>>> INSERT INTO term VALUES (1, 'is');
>>>>> INSERT INTO term VALUES (2, 'cool');
>>>>>
>>>>> -- all docs
>>>>> CREATE TABLE doc (id INTEGER, title STRING);
>>>>> INSERT INTO doc VALUES (10, 'manual');
>>>>> INSERT INTO doc VALUES (11, 'license');
>>>>>
>>>>> -- many-to-many relations
>>>>> CREATE TABLE term_doc (term_id INTEGER, doc_id INTEGER);
>>>>> INSERT INTO term_doc VALUES (0, 10);
>>>>> INSERT INTO term_doc VALUES (0, 11);
>>>>> INSERT INTO term_doc VALUES (1, 10);
>>>>> INSERT INTO term_doc VALUES (2, 10);
>>>>>
>>>>> *First question:*
>>>>> With OrientDB, turning the term and the doc records into Vertex 
>>>>> records is easy. No problem there.
>>>>> How could I take the content of term_doc and create the respective 
>>>>> edges? Can this be done in the OrientDB SQL console?
>>>>>
>>>>>
>>>>> *Second question:*Supposing I have solved the first issue, and I want 
>>>>> to count how many times each term appears in each document.
>>>>> In relational, this would be:
>>>>> SELECT term_id, doc_id, count(*)
>>>>> FROM   term_doc
>>>>> GROUP  BY term,doc;
>>>>>
>>>>> Now, I suppose this is not how you would do it once you have modeled 
>>>>> the term-doc problem into a graph database. Also because when I tried to 
>>>>> do 
>>>>> run this query against a simple import of term_doc, it was incredibly 
>>>>> slow.
>>>>> So, supposing I have real edges between term and doc, instead of the 
>>>>> explicit table term_doc, how could I obtain the count I mention above?
>>>>>
>>>>> Thanks!
>>>>> Roberto
>>>>>
>>>>> -- 
>>>>>
>>>>> --- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "OrientDB" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>>
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to