[Lucene-hadoop Wiki] Update of "Hbase/RDF" by udanax

Apache Wiki Thu, 16 Aug 2007 18:43:45 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

New page:
[[TableOfContents(4)]]
----
= Hbase RDF Storage Subsystems =

We Start to think about storing and querying RDF and RDF Schema in Hbase. 
[[BR]]but we'll do it at the last after prudence investigation and Hbase 
shell's HQL, Altools POC review.

We propose a Hbase subsystem for RDF called HbaseRDF, which uses Hbase + 
MapReduce to store RDF data and execute queries (e.g., SPARQL) on them.
We can store very sparse RDF data in a single table in Hbase, with as many 
columns as 
they need. For example, we might make a row for each RDF subject in a table and 
store all the properties and their values as columns in the table. 
This reduces costly self-joins, which results in efficient processing of 
queries, although we still need self-joins for RDF path queries.

We can further accelerate query performance by using MapReduce for 
parallel, distributed query processing. 

== Initial Contributors ==

 * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] 
(Research and Development center, NHN corp.)
 * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] 
(Database Lab Division of Computer Science, KAIST) 

== Background ==

~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- 
udanax''-~

[http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]

== Rationale ==

 * What is RDF
 * Previous methods for storing RDF data and processing queries
  * Their weak points
 * The method in Hbase
  * Strong points

== Considerations ==
The Sawzall paper says that the record-at-a-time model is not good for table 
joins. I think this problem occurs in typical join operations. Think about what 
kinds of join operations Sawzall or MapReduce can perform efficienly in 
parallel, or possibly process them at all at the first time. 

When we perform a nested loop join, for each tuple in the outer table, table R, 
we have to go through the inner table, table S, to find 
all the joining tuples. To perform nested loop joins with MapReduce,
we divide table R into M partitions, and each 
map worker joins one partition of the table at a time with S. 
Each map worker produces the amount of join results corresponding 
to the partition it is in charge of.

However, in merge joins, since table R is already sorted by 
subject, each map worker only needs to read logN tuples in table S, where
N is the number of tuples in S, which results in fast join performance.

C-Store also says that join operations can be performed efficiently in DSM 
because
tables in DSM are sorted by subject, thus we can use merge joins on a sorted 
attribute. 

The key parts are how fast Sawzall or MapReduce performs merge joins and 
how cleverly we materialize join results for efficient query processing of RDF 
path queries. Especially, the formal part is the key to beat 
C-Store query performance: defeat C-Store with massively parallelized query 
processing.

But the problem is that there is an initial delay in executing MapReduce jobs 
due to 
the time spent in assigning the computations to multiple machines. This 
might take far more time than necessary, thus hurt query response time. So, 
parallelism obtained by using MapReduce is best enjoyable for queries over huge 
amount of RDF data, where
it takes much time to process them. 
We might consider a selective parallelism where 
people can decide whether to use MapReduce or not to process their queries, as 
in 
"select ... '''in parallel'''".

== HbaseRDF Data Loader ==
HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
into a Hbase table in such a way that efficient query processing is possible.
It reads a triple at a time and inserts the triple into a Hbase table as 
follows:

{{{#!python numbering=off
value_count = 0
for s, p, o in triples:
  insert into rdf_table ('p:value_count') values ('o')
    where row='s'
  value_count = value_count + 1
}}}

Examples with the data from C-Store.

{{{#!CSV ;  
Subj.; Prop.; Obj.
ID1; type; BookType
ID1; title; âXYZâ
ID1; author; âFox, Joeâ
ID1; copyright; â2001â
ID2; type; CDType
ID2; title; âABCâ
ID2; artist; âOrr, Timâ
ID2; copyright; â1985â
ID2; language; âFrenchâ
ID3; type; BookType
ID3; title; âMNOâ
ID3; language; âEnglishâ
ID4; type; DVDType
ID4; title; âDEFâ
ID5; type; CDType
ID5; title; âGHIâ
ID5; copyright; â1995â
ID6; type; BookType
ID6; copyright; â2004â
}}}

== HbaseRDF Query Processor ==
HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a 
Hbase table. 
It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers 
and returns the results
to the user. 

Query processing steps are as follows:

 * Parsing, in which a parse tree, representing the SPARQL query is constructed.
 * Query rewrite, in which the parse tree is converted to an initial query 
plan, which is, in turn, transformed into an equivalent plan that is expected 
to require less time to execute. We have to choose which algorithm to use for 
each operation in the selected plan. Among them are MapReduce jobs for parallel 
algorithms.
 * Execute the plan
 
== HbaseRDF Data Materializer ==
HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the 
results
into a Hbase table. Later, HQP uses those materialized data for efficient 
processing of 
RDF path queries. 

== Hbase Shell Extention ==
=== Hbase Shell - RDF Shell ===
{{{
Hbase > rdf;

Hbase RDF version 0.1
Type 'help;' for help.

Hbase.RDF > SELECT ?title
          > FROM rdf_table
          > WHERE { ?book author ââFox, Joeââ
          >         ?book copyright ââ2001ââ
          >         ?book title ?title }

results here.

Hbase.RDF > exit;

Hbase > 
}}}

=== Hbase SPARQL ===
 * Support for the full SPARQL syntax
 * Support for a syntax to load RDF data into an Hbase table

== Alternatives ==
 * A triples table stores RDF triples in a single table with three attributes, 
subject, property, and object.

 * A property table. Put properties frequently queried togather into a single 
table to reduce costly self-joins. Used in Jena and Oracle. 

 * A dicomposed storage model (DSM), one table for each property, sorted by the 
subject. Used in C-Store.
  * ''Actually, the discomposed storage model is almost the same as the storage 
model in Hbase.''


== Food for thought ==
 * What are the differences between Hbase and C-Store.
 
 * Is DSM suitable for Hbase?

 * How to translate SPARQL queries into MapReduce functions, or Hbase APIs. 

== Hbase RDF Storage Subsystems Architecture ==

 * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture.
 * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language.

----
= Papers =

 * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - 
proposes a very simple, but powerfull, and highly parallelized data processing 
technique.-~
 * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf 
Column-Stores For Wide and Sparse Data]'' - discusses the benefits of using 
C-Store to store RDF and XML data.-~
 * ~-VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable 
Semantic Web Data Management Using Vertical Partitoning]'' - proposes an 
efficient method to store RDF data in table projections (i.e., columns) and 
executes queries on them.-~

[Lucene-hadoop Wiki] Update of "Hbase/RDF" by udanax

Reply via email to