Apache Beam pipelines for Rya (indexing, etc.)

Maxim Kolchin Fri, 24 Aug 2018 00:34:12 -0700

Hi,

>From our experience with 6 billion triples, the MapReduce job for computing
prospects is quite heavy, we've not been able to complete it with resources
had. So we decided to give a chance to Apache Beam (on Google Dataflow).
We've built a prototype pipeline which you can find in
http://github.com/DataFabricRus/rya-beam-pipelines. It's still an early
version, but it works.


We had to slightly change the design of the prospect querying algorithm, to
be able to employ the summing combiner on the Accumulo side. Therefore to
fully use the pipeline you need to use the patched version of Rya which is
available at
https://github.com/DataFabricRus/incubator-rya/tree/datafabric-patches.

The steps:
 1) loads triples from the SPO table,
 2) converts the triples to intermediate prospects,
 3) splits the prospects into batches of the given size,
 4) aggregates each batch separately and write them as mutations to the
Prospect table

 After the steps above we have partially aggregated prospects in the table.

 5) Later on the major and minor compaction, or on the scan, the summing
combiner finishes the aggregation of the prospects.

Usage of the summing combiner will allow later to compute the prospects on
INSERT/DELETE queries, so it'll only need to write a mutation with a value
1 or -1 in case of INSERT and DELETE respectively.

Maxim Kolchin

E-mail: [email protected]
Tel.: +7 (911) 199-55-73
Homepage: http://kolchinmax.ru

Apache Beam pipelines for Rya (indexing, etc.)

Reply via email to