Hi there,

On 20/03/14 03:01, ???? wrote:
Dear developers:
           I am a student. My name is Li Zhiguo. Recently , I am reseraching on 
how to develop a  SPARQL query engine which will run over Hadoop clusters . I 
plan to use to Jena API and do some extensions on ARQ ,but I don't know how to 
begin my plan .Does someone of you have done some works on this direction?
          What should I do firstly  ?

(I have not done a Hadoop implementation)

A few thoughts:

** A plan

* Time-Resources-Functionality

These 3 dimensions bound what you can do. How much time do you have? What resources do you have (i.e. people - I guess just you)? What functionality do you want?

Choose 2 of 3 - the third aspect is then fixed.

* Define the problem you are going to to solve. Is it to show possibilities of different implementations or is it to build a system to solve a particular use case? Do you have (a lot of) data?

* A quick look around at other work (I see you've found some papers already) to see what's been tried.

There have been several experimental systems using Hadoop, Cassandra, Accumulo and other NoSQL/BigData stores. A survey of those to see what they did (and why). At least know what they've done in general principle; not the deep detail.

** A note of caution:

The Hadoop world is changing. MapReduce is not the only way to use a cluster.

+ Look at Apache Spark - mapping SPARQL to RDD operations looks like an interesting route to consider.

+ At least know about YARN in Hadoop2 - Hadoop is being split into YARN (a distributed operating system scheduler) with MapReduce being just one application framework. It does not solve the problem - it's the direction Hadoop is going in.

+ Have at least some familiarity with what the SQL-on-Hadoop world is doing - SPARQL is sufficiently similar to SQL that approaches for SQL execution are very likely to apply to SPARQL.


** Once you have a design, then look at how to use the Jena API. The design
should not be distorted just to fit the API.

I would expect you will want to extend OpExecutor which is the general SPARQL execution class.

If you can implement OpFilter and OpBGP execution you get a certain degree of scale (and particularly a filter over an basic graph pattern - it's the main building block).

If you want to go further (e.g. efficient group operations), then it can be done incrementally on top of that.

There is some experimental code elsewhere [1] with slightly better abstractions for extension.

But get the design in place first.



          Best wishes to you all !

Let us know how you get on. I'm sure people on this list will be interested.

        Andy

[1] My GitHub account.



Reply via email to