[orientdb] Efficiently load data into OrientDB from Spark

alex Mon, 25 Sep 2017 23:41:33 -0700

Hello friends, how is your day?

I wanted to ask you for suggestion on how to load data into OrientDB 
efficiently.
I tried various approaches and none of them were sufficient.


My configuration at the moment is my laptop. Which is 8gb RAM and i7 
processor.

The data I'm trying to ingest is quite modest, 40k nodes. With some 
properties. I need to be able to run the same loading job idempotently, 
that is, running it twice won't produce twice the data.

The approaches I tried are:


   1. calling foreachPartition on the dataframe, there creating a 
   connection to OrientDB (Because the classes are not thread safe)
      from there, determining an identifying value for the record. e.g. 
   taking the 'person_id' coulmn, and querying OrientDB (against an index, of 
   course) to see if such node already exists, skip it, otherwise create it
     This approach works terribly slow. it took over 15 hours to run, and I 
   gave up. 
   2. I tried modifying this approach to not query OrientDB before 
   indexing, it does work, but then all of my data is duplicated since I can't 
   assign my own ID.
   3. Using the Spark for OrientDB connector, but it messed up my data 
   model, either complaining they don't exist or that they already exists.
   4. Looking at the source code of it, it seems that it will handle the 
   idempotenty by firstly deleting the graph, which is not a desired case.

Using Batch, It won't work for remote database! which is a problem since 
the job is run on a different server.
 
I think these designs aren't all that great, but I would expect them to run 
at least couple of hours and not 15..

Any advice ? 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[orientdb] Efficiently load data into OrientDB from Spark

Reply via email to