Re: One question about distributed ARQ

Paul Houle Thu, 20 Mar 2014 09:27:12 -0700

I've been thinking about RDF processing in Hadoop

https://github.com/paulhoule/infovore

and other things you should be aware of are

http://www.sindicetech.com/

and

http://vital.ai/

Infovore development so far has been pretty ad-hoc,  the goals have
been to solve some specific problems I have in front of myself and
also to get a better understanding of the Hadoop+RDF situation.
Lately I am thinking about what a next-generation system (Infovore 3
or 4) would look like.

Something that old versions of Infovore did was gather together all
triples with a certain ?s into an in Jena memory-model in the reducer,
 and then you can do what you want with the Jena model.  I was happy
with the performance I got when I was using models without inference
but I found that for my workload (700 million triples on a single
machine) I couldn't afford to use inference with Jena.  There are a
wide range of similar scenarios where you create partial graphs of
various sorts and do SPARQL queries,  but the trouble is you are
running on a partial graph and you can't do arbitrary queries.

If you want to write something that does general SPARQL queries I'd be
picturing something that works a lot like Pig or Hive.  The big
difference with those is that you need to implement the RDF data model
instead of the data models implemented by Pig and Hive.  The basic
data structure,  I think,  would be a tuple of RDF nodes (aka a SPARQL
result set) and in that case a triple is just three nodes with a quad.
 For efficiency's sake,  such a system should have some flexibility of
typing.  It ought to be possible to have a column that can be any kind
of node (which means type information has to be embedded in the row)
but also to have one that we know is a URI and also to have one we
know as an integer so we don't need to encode type information in the
row or go through the process of converting an ASCII string to
numbers.

The current Hadoop API has a kind of data type called a Writable that,
 in theory,  would let you write an RDF tuple parser that uses
pre-allocated memory and that ought to give you a big performance
improvement because you're not converting UTF-8 to UTF-16 and back and
you're not allocating large numbers of small Strings that cost money
to allocate and then later on freeze up the garbage collector.  The
benefits are particularly strong when you're doing operations that
don't change all the tuples.  If you're testing on the predicate for
instance,  and sending on the ?s and ?o unchanged,  a system like this
doesn't have to waste time copying the ?s and the ?o.  The trick is
that the Writables are mutable and it is tempting to do tricks (in the
framework) such as splitting a Text without copying it by pointing
different indexes into the same byte array and very bad things will
happen if the people using the framework don't follow the rules.  For
a system that generates the code for the steps,  however,  you can
work around it,  although people who write those kind of systems often
choose copy-happy strategies since these are easy to reason about.

Andy is right about Hadoop 2,  but I'd note that YARN itself is a
low-level API.  The classic Map/Reduce API has been reimplemented on
top of YARN,  and the evolution path for applications writers is that
we'll migrate to other APIs implemented on top of Yarn.  The one that
looks the best is Apache Tez

http://tez.incubator.apache.org/

This deals with the fact that the M/R model isn't always the right
model for every job.  For instance,  sometimes you want to Map and
then you want to Reduce and Reduce again,  or perhaps you want to
write a Map that creates three different data streams and you'd like
to send those three data streams to three different reducers,  etc.
You can definitely break up any data flow into M/R steps and sending
intermediate results,  but Tez generalizes the M/R model so you get
the data flow you want directly with high efficiency.  Thus developing
something new in 2014Q2 it makes sense to start with Tez.

Anyhow I am very interested in collaborating on this because I
understand some parts of this very well but other parts very little.

On Thu, Mar 20, 2014 at 6:24 AM, Andy Seaborne <[email protected]> wrote:
> Hi there,
>
>
> On 20/03/14 03:01, 彼岸 wrote:
>>
>> Dear developers:
>>            I am a student. My name is Li Zhiguo. Recently , I am
>> reseraching on how to develop a  SPARQL query engine which will run over
>> Hadoop clusters . I plan to use to Jena API and do some extensions on ARQ
>> ,but I don't know how to begin my plan .Does someone of you have done some
>> works on this direction?
>>           What should I do firstly  ?
>
>
> (I have not done a Hadoop implementation)
>
> A few thoughts:
>
> ** A plan
>
> * Time-Resources-Functionality
>
> These 3 dimensions bound what you can do.  How much time do you have? What
> resources do you have (i.e. people - I guess just you)?  What functionality
> do you want?
>
> Choose 2 of 3 - the third aspect is then fixed.
>
> * Define the problem you are going to to solve. Is it to show possibilities
> of different implementations or is it to build a system to solve a
> particular use case?  Do you have (a lot of) data?
>
> * A quick look around at other work (I see you've found some papers already)
> to see what's been tried.
>
> There have been several experimental systems using Hadoop, Cassandra,
> Accumulo and other NoSQL/BigData stores.  A survey of those to see what they
> did (and why).  At least know what they've done in general principle; not
> the deep detail.
>
> ** A note of caution:
>
> The Hadoop world is changing.  MapReduce is not the only way to use a
> cluster.
>
> + Look at Apache Spark - mapping SPARQL to RDD operations looks like an
> interesting route to consider.
>
> + At least know about YARN in Hadoop2 - Hadoop is being split into YARN (a
> distributed operating system scheduler) with MapReduce being just one
> application framework.  It does not solve the problem - it's the direction
> Hadoop is going in.
>
> + Have at least some familiarity with what the SQL-on-Hadoop world is doing
> - SPARQL is sufficiently similar to SQL that approaches for SQL execution
> are very likely to apply to SPARQL.
>
>
> ** Once you have a design, then look at how to use the Jena API. The design
> should not be distorted just to fit the API.
>
> I would expect you will want to extend OpExecutor which is the general
> SPARQL execution class.
>
> If you can implement OpFilter and OpBGP execution you get a certain degree
> of scale (and particularly a filter over an basic graph pattern - it's the
> main building block).
>
> If you want to go further (e.g. efficient group operations), then it can be
> done incrementally on top of that.
>
> There is some experimental code elsewhere [1] with slightly better
> abstractions for extension.
>
> But get the design in place first.
>
>
>
>>
>>           Best wishes to you all !
>
>
> Let us know how you get on.  I'm sure people on this list will be
> interested.
>
>         Andy
>
> [1] My GitHub account.
>
>>
>

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   [email protected]

Re: One question about distributed ARQ

Reply via email to