Hi Miel, We run our Spark jobs as an embarrassingly parallel workflow. Each worker processes a partition of a dataframe with the RDFLib API to create the statements in memory. When each worker finishes processing their partition, they use HTTP to send the payload to a remote triplestore as files. These files are then bulk-loaded into the triplestore.
It will be interesting to see if there's any performance improvement or cost by using SPARQL Update through the RDFLib SPARQLUpdateStore. Our RDBMS contains just plain tables, no JSON objects, so it's probably a bit easier for us to work within the Spark dataframes compared to your case with the JSON files. >From memory, I think we let Spark infer the schema. If there are any problems with the inferred types, then we explicitly state them with PySpark's schema. Hope this helps. Cheers, Edmond On Thu, Apr 22, 2021 at 6:33 PM Miel Vander Sande < miel.vandersa...@meemoo.be> wrote: > Hi Edmund, > > Great to hear! Maybe some follow-up questions: > - What do you use to perform the mapping? The rdflib api or SPARQL > construct somehow? > - What's in your RDBs? Does it contain embedded json (that's what we have) > or do you have plain tables? > - Do you use a PySpark Schema? > > Best, > > Miel > > Op do 22 apr. 2021 om 02:23 schreef Edmond Chuc <edmond.c...@gmail.com>: > >> Hi Miel, >> >> You can definitely create ETL pipelines with Apache Spark (using PySpark) >> and RDFLIb. Read the JSON records into a Spark dataframe >> <https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/> >> and create triples in a graph with RDFLib. This is how we are producing >> triples from relational databases with Spark at my workplace. Don't forget >> to repartition the dataframe into multiple chunks to process chunks of the >> same dataframe in parallel. >> >> Cheers, >> >> Edmond >> >> On Wed, Apr 21, 2021 at 10:16 PM Miel Vander Sande < >> miel.vandersa...@meemoo.be> wrote: >> >>> >>> Hi all, >>> >>> I'm not sure whether this is the right place for this questions, but >>> AFAIK the RDF python community does not have a general community mailing >>> list like RDF.js? >>> >>> I was wondering whether there were any libraries / efforts using RDFLib >>> to create ETL pipelines for constructing RDF from various sources? I could >>> definitely use something like that, but couldn't really find anything yet. >>> The RML-based tools don't really work that well for my use cases (json >>> records) and they miss some transparency for debugging / iterop with other >>> libraries when producing triples. >>> >>> I was already starting to thing about a possible API and how it could >>> leverage Dask or Spark to really scale up. I'm not a Python/data >>> engineering expert, so this might come across as naive. >>> >>> ``` >>> Mapping() # lazy execution pipeline object >>> .load(file1.json) # Creates graph from direct json mapping >>> .construct(query1) # Creates new graph containing mapping from >>> file1.json graph >>> .construct(query2) # Creates new graph containing mapping from >>> file1.json graph >>> .load(file2.json) # Creates graph from direct json mapping >>> .construct(query3) # Creates new map graph containing mapping from >>> file2.json graph >>> .collect() # aggregates all constructed graphs into one >>> .check(shacl) # validate the constructed graph against mapping >>> .run() # actually runs the pipeline >>> >>> ``` >>> >>> Best, >>> >>> Miel >>> >>> -- >>> http://github.com/RDFLib >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "rdflib-dev" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to rdflib-dev+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com >>> <https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> http://github.com/RDFLib >> --- >> You received this message because you are subscribed to the Google Groups >> "rdflib-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to rdflib-dev+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com >> <https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > http://github.com/RDFLib > --- > You received this message because you are subscribed to the Google Groups > "rdflib-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to rdflib-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/rdflib-dev/CAHeRLWu3vFPrQifgidWT-VpruDsd4kKgnNf3EAz02fYtnzGnnA%40mail.gmail.com > <https://groups.google.com/d/msgid/rdflib-dev/CAHeRLWu3vFPrQifgidWT-VpruDsd4kKgnNf3EAz02fYtnzGnnA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- http://github.com/RDFLib --- You received this message because you are subscribed to the Google Groups "rdflib-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyS%3DiV7PhNSZUesnrrRv0HjdT_04BgNuyGx0kG-qZeW%2Bvw%40mail.gmail.com.