Hi, >From our experience with 6 billion triples, the MapReduce job for computing prospects is quite heavy, we've not been able to complete it with resources had. So we decided to give a chance to Apache Beam (on Google Dataflow). We've built a prototype pipeline which you can find in http://github.com/DataFabricRus/rya-beam-pipelines. It's still an early version, but it works.
We had to slightly change the design of the prospect querying algorithm, to be able to employ the summing combiner on the Accumulo side. Therefore to fully use the pipeline you need to use the patched version of Rya which is available at https://github.com/DataFabricRus/incubator-rya/tree/datafabric-patches. The steps: 1) loads triples from the SPO table, 2) converts the triples to intermediate prospects, 3) splits the prospects into batches of the given size, 4) aggregates each batch separately and write them as mutations to the Prospect table After the steps above we have partially aggregated prospects in the table. 5) Later on the major and minor compaction, or on the scan, the summing combiner finishes the aggregation of the prospects. Usage of the summing combiner will allow later to compute the prospects on INSERT/DELETE queries, so it'll only need to write a mutation with a value 1 or -1 in case of INSERT and DELETE respectively. Maxim Kolchin E-mail: [email protected] Tel.: +7 (911) 199-55-73 Homepage: http://kolchinmax.ru
