Hi there,
On 20/03/14 03:01, ???? wrote:
Dear developers:
I am a student. My name is Li Zhiguo. Recently , I am reseraching on
how to develop a SPARQL query engine which will run over Hadoop clusters . I
plan to use to Jena API and do some extensions on ARQ ,but I don't know how to
begin my plan .Does someone of you have done some works on this direction?
What should I do firstly ?
(I have not done a Hadoop implementation)
A few thoughts:
** A plan
* Time-Resources-Functionality
These 3 dimensions bound what you can do. How much time do you have?
What resources do you have (i.e. people - I guess just you)? What
functionality do you want?
Choose 2 of 3 - the third aspect is then fixed.
* Define the problem you are going to to solve. Is it to show
possibilities of different implementations or is it to build a system to
solve a particular use case? Do you have (a lot of) data?
* A quick look around at other work (I see you've found some papers
already) to see what's been tried.
There have been several experimental systems using Hadoop, Cassandra,
Accumulo and other NoSQL/BigData stores. A survey of those to see what
they did (and why). At least know what they've done in general
principle; not the deep detail.
** A note of caution:
The Hadoop world is changing. MapReduce is not the only way to use a
cluster.
+ Look at Apache Spark - mapping SPARQL to RDD operations looks like an
interesting route to consider.
+ At least know about YARN in Hadoop2 - Hadoop is being split into YARN
(a distributed operating system scheduler) with MapReduce being just one
application framework. It does not solve the problem - it's the
direction Hadoop is going in.
+ Have at least some familiarity with what the SQL-on-Hadoop world is
doing - SPARQL is sufficiently similar to SQL that approaches for SQL
execution are very likely to apply to SPARQL.
** Once you have a design, then look at how to use the Jena API. The design
should not be distorted just to fit the API.
I would expect you will want to extend OpExecutor which is the general
SPARQL execution class.
If you can implement OpFilter and OpBGP execution you get a certain
degree of scale (and particularly a filter over an basic graph pattern -
it's the main building block).
If you want to go further (e.g. efficient group operations), then it can
be done incrementally on top of that.
There is some experimental code elsewhere [1] with slightly better
abstractions for extension.
But get the design in place first.
Best wishes to you all !
Let us know how you get on. I'm sure people on this list will be
interested.
Andy
[1] My GitHub account.