I've been thinking about RDF processing in Hadoop https://github.com/paulhoule/infovore
and other things you should be aware of are http://www.sindicetech.com/ and http://vital.ai/ Infovore development so far has been pretty ad-hoc, the goals have been to solve some specific problems I have in front of myself and also to get a better understanding of the Hadoop+RDF situation. Lately I am thinking about what a next-generation system (Infovore 3 or 4) would look like. Something that old versions of Infovore did was gather together all triples with a certain ?s into an in Jena memory-model in the reducer, and then you can do what you want with the Jena model. I was happy with the performance I got when I was using models without inference but I found that for my workload (700 million triples on a single machine) I couldn't afford to use inference with Jena. There are a wide range of similar scenarios where you create partial graphs of various sorts and do SPARQL queries, but the trouble is you are running on a partial graph and you can't do arbitrary queries. If you want to write something that does general SPARQL queries I'd be picturing something that works a lot like Pig or Hive. The big difference with those is that you need to implement the RDF data model instead of the data models implemented by Pig and Hive. The basic data structure, I think, would be a tuple of RDF nodes (aka a SPARQL result set) and in that case a triple is just three nodes with a quad. For efficiency's sake, such a system should have some flexibility of typing. It ought to be possible to have a column that can be any kind of node (which means type information has to be embedded in the row) but also to have one that we know is a URI and also to have one we know as an integer so we don't need to encode type information in the row or go through the process of converting an ASCII string to numbers. The current Hadoop API has a kind of data type called a Writable that, in theory, would let you write an RDF tuple parser that uses pre-allocated memory and that ought to give you a big performance improvement because you're not converting UTF-8 to UTF-16 and back and you're not allocating large numbers of small Strings that cost money to allocate and then later on freeze up the garbage collector. The benefits are particularly strong when you're doing operations that don't change all the tuples. If you're testing on the predicate for instance, and sending on the ?s and ?o unchanged, a system like this doesn't have to waste time copying the ?s and the ?o. The trick is that the Writables are mutable and it is tempting to do tricks (in the framework) such as splitting a Text without copying it by pointing different indexes into the same byte array and very bad things will happen if the people using the framework don't follow the rules. For a system that generates the code for the steps, however, you can work around it, although people who write those kind of systems often choose copy-happy strategies since these are easy to reason about. Andy is right about Hadoop 2, but I'd note that YARN itself is a low-level API. The classic Map/Reduce API has been reimplemented on top of YARN, and the evolution path for applications writers is that we'll migrate to other APIs implemented on top of Yarn. The one that looks the best is Apache Tez http://tez.incubator.apache.org/ This deals with the fact that the M/R model isn't always the right model for every job. For instance, sometimes you want to Map and then you want to Reduce and Reduce again, or perhaps you want to write a Map that creates three different data streams and you'd like to send those three data streams to three different reducers, etc. You can definitely break up any data flow into M/R steps and sending intermediate results, but Tez generalizes the M/R model so you get the data flow you want directly with high efficiency. Thus developing something new in 2014Q2 it makes sense to start with Tez. Anyhow I am very interested in collaborating on this because I understand some parts of this very well but other parts very little. On Thu, Mar 20, 2014 at 6:24 AM, Andy Seaborne <[email protected]> wrote: > Hi there, > > > On 20/03/14 03:01, 彼岸 wrote: >> >> Dear developers: >> I am a student. My name is Li Zhiguo. Recently , I am >> reseraching on how to develop a SPARQL query engine which will run over >> Hadoop clusters . I plan to use to Jena API and do some extensions on ARQ >> ,but I don't know how to begin my plan .Does someone of you have done some >> works on this direction? >> What should I do firstly ? > > > (I have not done a Hadoop implementation) > > A few thoughts: > > ** A plan > > * Time-Resources-Functionality > > These 3 dimensions bound what you can do. How much time do you have? What > resources do you have (i.e. people - I guess just you)? What functionality > do you want? > > Choose 2 of 3 - the third aspect is then fixed. > > * Define the problem you are going to to solve. Is it to show possibilities > of different implementations or is it to build a system to solve a > particular use case? Do you have (a lot of) data? > > * A quick look around at other work (I see you've found some papers already) > to see what's been tried. > > There have been several experimental systems using Hadoop, Cassandra, > Accumulo and other NoSQL/BigData stores. A survey of those to see what they > did (and why). At least know what they've done in general principle; not > the deep detail. > > ** A note of caution: > > The Hadoop world is changing. MapReduce is not the only way to use a > cluster. > > + Look at Apache Spark - mapping SPARQL to RDD operations looks like an > interesting route to consider. > > + At least know about YARN in Hadoop2 - Hadoop is being split into YARN (a > distributed operating system scheduler) with MapReduce being just one > application framework. It does not solve the problem - it's the direction > Hadoop is going in. > > + Have at least some familiarity with what the SQL-on-Hadoop world is doing > - SPARQL is sufficiently similar to SQL that approaches for SQL execution > are very likely to apply to SPARQL. > > > ** Once you have a design, then look at how to use the Jena API. The design > should not be distorted just to fit the API. > > I would expect you will want to extend OpExecutor which is the general > SPARQL execution class. > > If you can implement OpFilter and OpBGP execution you get a certain degree > of scale (and particularly a filter over an basic graph pattern - it's the > main building block). > > If you want to go further (e.g. efficient group operations), then it can be > done incrementally on top of that. > > There is some experimental code elsewhere [1] with slightly better > abstractions for extension. > > But get the design in place first. > > > >> >> Best wishes to you all ! > > > Let us know how you get on. I'm sure people on this list will be > interested. > > Andy > > [1] My GitHub account. > >> > -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype [email protected]
