Update: Re-reading what I just wrote, I figured a quick way to do this was to export my relational term_doc into a text file, and for each line produce the corresponding CREATE EDGE statement. I can imagine CREATE EDGE is slower when using subqueries, but in became reasonable as soon as I created indices on both term and doc classes. So I guess this is how I'll run my evaluation for now.
I would still like your help for the following though: Suppose I create all edges, and suppose each edge gets assigned a weight property (I know, this will materialise the edges and make them slower, but I really need properties). Suppose also that each term-doc Vertex pair can have more than one edge (of the same class - is this allowed?), each with a different weight: CREATE EDGE occurs_in from (select from term where id=0) to (select from doc where id=10) set weigth = 3 CREATE EDGE occurs_in from (select from term where id=0) to (select from doc where id=10) set weight = 5 CREATE EDGE occurs_in from (select from term where id=0) to (select from doc where id=11) set weight = 2 etc How can I get (efficiently), the SUM of all weights for the edges between each unique combination of term and doc vertices? So in the example above I should get 8 for the pair (term=0, doc=10), and 2 for the pair (term=0, doc=11) Roberto On Thursday, 2 October 2014 12:08:46 UTC+2, Roberto Cornacchia wrote: > > So, suppose I start from the relational tables in my original post. > > I guess, if I tried to translate that by hand, I'd get something like: > -- all terms > CREATE CLASS term extends V; > CREATE property id integer; > CREATE property term string; > INSERT INTO term VALUES (0, 'OrientDB'); > INSERT INTO term VALUES (1, 'is'); > INSERT INTO term VALUES (2, 'cool'); > > -- all docs > CREATE CLASS doc extends V; > CREATE property id integer; > CREATE property title string; > INSERT INTO doc VALUES (10, 'manual'); > INSERT INTO doc VALUES (11, 'license'); > > -- the edges > CREATE CLASS occurs_in extends E > CREATE EDGE occurs_in from (select from term where id=0) to (select from > doc where id=10) > CREATE EDGE occurs_in from (select from term where id=0) to (select from > doc where id=11) > CREATE EDGE occurs_in from (select from term where id=1) to (select from > doc where id=10) > CREATE EDGE occurs_in from (select from term where id=2) to (select from > doc where id=10) > > This is how I would do it for a small example like this. > But in a real scenario, when the relationships come from a stored (large) > table (term_doc above) - this is a standard many-to-many relationship > schema - I don't know how to construct the edges (either with SQL or with > the ETL processor): somehow I need 2 joins (term -> term_doc <- doc) in > order to connect the vertices of term and doc. > > > Background: the reason why I'm trying to do this is to replicate real data > with real data distributions and sizes, from a relational DB into OrientDB, > in order to run some tests and see how it would fit my needs (where it > would be slower/faster, which features would be available/missing, etc). > Obviously, if I want to compare the two, I need to map the problem into > the graph space, and not use OrientDB in a relational way (as I mentioned > above, I tried this and as one could expected the performance loss is huge). > > Roberto > > On Thursday, 2 October 2014 10:35:44 UTC+2, Lvc@ wrote: >> >> Hi Roberto, >> Could you provide a simple example of what you need in terms of source >> tables and expected result in OrientDB? >> >> Lvc@ >> ᐧ >> >> On 2 October 2014 09:56, Roberto Cornacchia <[email protected]> >> wrote: >> >>> Thanks Luca, >>> >>> I just had a look at the documentation for ETL. >>> If I understand correctly, it's not possible to start from the 3 >>> separate tables like above (term, doc, term_doc), as the joinFieldName >>> needs to refer to a field of one of the Vertex tables (either term or doc >>> in this case). >>> Is that correct? Or can this still be done? >>> So I'd have to join the tables before using the ETL module? >>> >>> >>> Best, Roberto >>> >>> On Wednesday, 1 October 2014 18:47:28 UTC+2, Lvc@ wrote: >>>> >>>> Hi Roberto, >>>> I suggest you to use the ETL module: much more powerful and fast: >>>> >>>> http://www.orientechnologies.com/docs/last/orientdb-etl. >>>> wiki/Import-from-DBMS.html >>>> >>>> Lvc@ >>>> >>>> ᐧ >>>> >>>> On 1 October 2014 17:56, Roberto Cornacchia <[email protected]> >>>> wrote: >>>> >>>>> Hi there, >>>>> >>>>> I'm trying to approach OrientDB, coming from relational background. >>>>> >>>>> I've already looked at this: http://www.orientechnologies.com/docs/ >>>>> last/orientdb.wiki/Import-RDBMS-to-Graph-Model.html >>>>> >>>>> However, I'm not sure how I would do this when I start from a >>>>> many-to-many relationship. >>>>> >>>>> *Example (term-doc matrix as used in information retrieval - which >>>>> term occurs in which document)* >>>>> >>>>> -- all terms >>>>> CREATE TABLE term (id INTEGER, term STRING); >>>>> INSERT INTO term VALUES (0, 'OrientDB'); >>>>> INSERT INTO term VALUES (1, 'is'); >>>>> INSERT INTO term VALUES (2, 'cool'); >>>>> >>>>> -- all docs >>>>> CREATE TABLE doc (id INTEGER, title STRING); >>>>> INSERT INTO doc VALUES (10, 'manual'); >>>>> INSERT INTO doc VALUES (11, 'license'); >>>>> >>>>> -- many-to-many relations >>>>> CREATE TABLE term_doc (term_id INTEGER, doc_id INTEGER); >>>>> INSERT INTO term_doc VALUES (0, 10); >>>>> INSERT INTO term_doc VALUES (0, 11); >>>>> INSERT INTO term_doc VALUES (1, 10); >>>>> INSERT INTO term_doc VALUES (2, 10); >>>>> >>>>> *First question:* >>>>> With OrientDB, turning the term and the doc records into Vertex >>>>> records is easy. No problem there. >>>>> How could I take the content of term_doc and create the respective >>>>> edges? Can this be done in the OrientDB SQL console? >>>>> >>>>> >>>>> *Second question:*Supposing I have solved the first issue, and I want >>>>> to count how many times each term appears in each document. >>>>> In relational, this would be: >>>>> SELECT term_id, doc_id, count(*) >>>>> FROM term_doc >>>>> GROUP BY term,doc; >>>>> >>>>> Now, I suppose this is not how you would do it once you have modeled >>>>> the term-doc problem into a graph database. Also because when I tried to >>>>> do >>>>> run this query against a simple import of term_doc, it was incredibly >>>>> slow. >>>>> So, supposing I have real edges between term and doc, instead of the >>>>> explicit table term_doc, how could I obtain the count I mention above? >>>>> >>>>> Thanks! >>>>> Roberto >>>>> >>>>> -- >>>>> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "OrientDB" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "OrientDB" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
