Hello friends, how is your day?
I wanted to ask you for suggestion on how to load data into OrientDB
efficiently.
I tried various approaches and none of them were sufficient.
My configuration at the moment is my laptop. Which is 8gb RAM and i7
processor.
The data I'm trying to ingest is quite modest, 40k nodes. With some
properties. I need to be able to run the same loading job idempotently,
that is, running it twice won't produce twice the data.
The approaches I tried are:
1. calling foreachPartition on the dataframe, there creating a
connection to OrientDB (Because the classes are not thread safe)
from there, determining an identifying value for the record. e.g.
taking the 'person_id' coulmn, and querying OrientDB (against an index, of
course) to see if such node already exists, skip it, otherwise create it
This approach works terribly slow. it took over 15 hours to run, and I
gave up.
2. I tried modifying this approach to not query OrientDB before
indexing, it does work, but then all of my data is duplicated since I can't
assign my own ID.
3. Using the Spark for OrientDB connector, but it messed up my data
model, either complaining they don't exist or that they already exists.
4. Looking at the source code of it, it seems that it will handle the
idempotenty by firstly deleting the graph, which is not a desired case.
Using Batch, It won't work for remote database! which is a problem since
the job is run on a different server.
I think these designs aren't all that great, but I would expect them to run
at least couple of hours and not 15..
Any advice ?
--
---
You received this message because you are subscribed to the Google Groups
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.