Hi Miel, You can definitely create ETL pipelines with Apache Spark (using PySpark) and RDFLIb. Read the JSON records into a Spark dataframe <https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/> and create triples in a graph with RDFLib. This is how we are producing triples from relational databases with Spark at my workplace. Don't forget to repartition the dataframe into multiple chunks to process chunks of the same dataframe in parallel.
Cheers, Edmond On Wed, Apr 21, 2021 at 10:16 PM Miel Vander Sande < miel.vandersa...@meemoo.be> wrote: > > Hi all, > > I'm not sure whether this is the right place for this questions, but AFAIK > the RDF python community does not have a general community mailing list > like RDF.js? > > I was wondering whether there were any libraries / efforts using RDFLib to > create ETL pipelines for constructing RDF from various sources? I could > definitely use something like that, but couldn't really find anything yet. > The RML-based tools don't really work that well for my use cases (json > records) and they miss some transparency for debugging / iterop with other > libraries when producing triples. > > I was already starting to thing about a possible API and how it could > leverage Dask or Spark to really scale up. I'm not a Python/data > engineering expert, so this might come across as naive. > > ``` > Mapping() # lazy execution pipeline object > .load(file1.json) # Creates graph from direct json mapping > .construct(query1) # Creates new graph containing mapping from file1.json > graph > .construct(query2) # Creates new graph containing mapping from file1.json > graph > .load(file2.json) # Creates graph from direct json mapping > .construct(query3) # Creates new map graph containing mapping from > file2.json graph > .collect() # aggregates all constructed graphs into one > .check(shacl) # validate the constructed graph against mapping > .run() # actually runs the pipeline > > ``` > > Best, > > Miel > > -- > http://github.com/RDFLib > --- > You received this message because you are subscribed to the Google Groups > "rdflib-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to rdflib-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com > <https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- http://github.com/RDFLib --- You received this message because you are subscribed to the Google Groups "rdflib-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com.