Re: One question about distributed ARQ

Andy Seaborne Thu, 20 Mar 2014 03:25:25 -0700

Hi there,

On 20/03/14 03:01, ???? wrote:

Dear developers:
           I am a student. My name is Li Zhiguo. Recently , I am reseraching on 
how to develop a  SPARQL query engine which will run over Hadoop clusters . I 
plan to use to Jena API and do some extensions on ARQ ,but I don't know how to 
begin my plan .Does someone of you have done some works on this direction?
          What should I do firstly  ?


(I have not done a Hadoop implementation)

A few thoughts:

** A plan

* Time-Resources-Functionality

These 3 dimensions bound what you can do. How much time do you have?What resources do you have (i.e. people - I guess just you)? Whatfunctionality do you want?


Choose 2 of 3 - the third aspect is then fixed.

* Define the problem you are going to to solve. Is it to showpossibilities of different implementations or is it to build a system tosolve a particular use case? Do you have (a lot of) data?

* A quick look around at other work (I see you've found some papersalready) to see what's been tried.

There have been several experimental systems using Hadoop, Cassandra,Accumulo and other NoSQL/BigData stores. A survey of those to see whatthey did (and why). At least know what they've done in generalprinciple; not the deep detail.


** A note of caution:

The Hadoop world is changing. MapReduce is not the only way to use acluster.

+ Look at Apache Spark - mapping SPARQL to RDD operations looks like aninteresting route to consider.

+ At least know about YARN in Hadoop2 - Hadoop is being split into YARN(a distributed operating system scheduler) with MapReduce being just oneapplication framework. It does not solve the problem - it's thedirection Hadoop is going in.

+ Have at least some familiarity with what the SQL-on-Hadoop world isdoing - SPARQL is sufficiently similar to SQL that approaches for SQLexecution are very likely to apply to SPARQL.



** Once you have a design, then look at how to use the Jena API. The design
should not be distorted just to fit the API.

I would expect you will want to extend OpExecutor which is the generalSPARQL execution class.

If you can implement OpFilter and OpBGP execution you get a certaindegree of scale (and particularly a filter over an basic graph pattern -it's the main building block).

If you want to go further (e.g. efficient group operations), then it canbe done incrementally on top of that.

There is some experimental code elsewhere [1] with slightly betterabstractions for extension.


But get the design in place first.


          Best wishes to you all !

Let us know how you get on. I'm sure people on this list will beinterested.


        Andy

[1] My GitHub account.

Re: One question about distributed ARQ

Reply via email to