[Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong

Apache Wiki Mon, 20 Aug 2007 14:08:38 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by InchulSong:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

------------------------------------------------------------------------------
- 
  [[TableOfContents(4)]]
  ----
- = HbaseRDF, an Hbase Subsystem for RDF =
+ == HbaseRDF, an Hbase Subsystem for RDF ==
  
-  -- ''Any comments on HbaseRDF are welcomed.''
+  -- ''Volunteers and any comments on HbaseRDF are welcomed.''
  
  We have started to think about storing and querying RDF data in Hbase. But 
we'll jump into its implementation after prudence investigation. 
  
- We propose an Hbase subsystem for RDF called HbaseRDF, which uses Hbase + 
MapReduce to store RDF data and execute queries (e.g., SPARQL) on them.
+ We call for the introduction of an Hbase subsystem for RDF, called HbaseRDF, 
which uses Hbase + MapReduce to store RDF data and execute queries (e.g., 
SPARQL) on them.
  We can store very sparse RDF data in a single table in Hbase, with as many 
columns as 
  they need. For example, we might make a row for each RDF subject in a table 
and store all the properties and their values as columns in the table. 
  This reduces costly self-joins in answering queries asking questions on the 
same subject, which results in efficient processing of queries, although we 
still need self-joins to answer RDF path queries.
@@ -18, +17 @@

  parallel, distributed query processing. 
  
  === Related projects ===
-   * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 Relational 
Algrebra Operators] is designing and implementing relational algebra operators. 
See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans Algebric Tools] for 
various algebric operators we are designing and planing to implement, including 
relational algebra operators.
+  * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 HADOOP-1608 
Relational Algrebra Operators] is designing and implementing relational algebra 
operators. See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans Algebric 
Tools] for various algebric operators we are designing and planing to 
implement, including relational algebra operators.
-   * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] 
provides a command line tool in which we can manipulate tables in Hbase. We are 
also planning to use HbaseShell to manipulate and query RDF data to be stored 
in Hbase.
+  * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] 
provides a command line tool in which we can manipulate tables in Hbase. We are 
also planning to use HbaseShell to manipulate and query RDF data to be stored 
in Hbase.
+  * [https://issues.apache.org/jira/browse/HADOOP-1120 contrib/data_join] 
provides helper classes to help implement data join operations through 
MapReduce jobs. Thanks to Runping.
   
- == Initial Contributors ==
+ === Initial Contributors ===
  
   * [:udanax:Edward Yoon] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] 
(Research and Development center, NHN corp.)
-  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] 
(Database Lab. , KAIST) 
+  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] 
(Database Lab, KAIST) 
  
- == Considerations ==
+ == Some Ideas ==
  When we store RDF data in a single Hbase table and process queries on them, 
an important issue we have to consider is how to efficiently perform costly 
self-joins needed to process RDF path queries. 
  
  To speed up these costly self-joins, it is natural to think about using 
@@ -55, +55 @@

  Currently, C-Store shows the best query performance on RDF data.
  However, we, armed with Hbase and MapReduceMerge, can do even better.
  
+ == Resources ==
+  * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a 
candidate recommendation of W3C as of 14 June 2007.
+  * A test suit for SPARQL can be found at 
http://www.w3.org/2001/sw/DataAccess/tests/r2. The web page provides test RDF 
data, SPARQL queries, and expected results.
+  * [https://jena.svn.sourceforge.net/svnroot/jena/ARQ/trunk/Grammar/sparql.jj 
SPARQL Grammer in JavaCC] - from Jena ARQ
+  * [http://esw.w3.org/topic/LargeTripleStores Large triple stores]
+ 
+ == Architecture Sketch ==
+ 
- == HbaseRDF Data Loader ==
+ === HbaseRDF Data Loader ===
  HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
  into a Hbase table in such a way that efficient query processing is possible. 
In Hbase, we can store everything in a single table.
  The sparsicy of RDF data is not a problem, because Hbase, which is 
@@ -65, +73 @@

  HDL reads a triple at a time and inserts the triple into a Hbase table as 
follows:
  
  {{{#!python numbering=off
- value_count = 0
+ value_count = 1
  for s, p, o in triples:
    insert into rdf_table ('p:value_count') values ('o')
      where row='s'
    value_count = value_count + 1
  }}}
  
- Examples with the data from C-Store.
- 
- {{{#!CSV ;  
- Subj.; Prop.; Obj.
- ID1; type; BookType
- ID1; title; âXYZâ
- ID1; author; âFox, Joeâ
- ID1; copyright; â2001â
- ID2; type; CDType
- ID2; title; âABCâ
- ID2; artist; âOrr, Timâ
- ID2; copyright; â1985â
- ID2; language; âFrenchâ
- ID3; type; BookType
- ID3; title; âMNOâ
- ID3; language; âEnglishâ
- ID4; type; DVDType
- ID4; title; âDEFâ
- ID5; type; CDType
- ID5; title; âGHIâ
- ID5; copyright; â1995â
- ID6; type; BookType
- ID6; copyright; â2004â
- }}}
- 
- == HbaseRDF Query Processor ==
+ === HbaseRDF Query Processor ===
  HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a 
Hbase table. 
  It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers 
and returns the results
  to the user. 
@@ -108, +91 @@

   * Query rewrite, in which the parse tree is converted to an initial query 
plan, which is, in turn, transformed into an equivalent plan that is expected 
to require less time to execute. We have to choose which algorithm to use for 
each operation in the selected plan. Among them are parallel versions of 
algorithms, such as parallel joins with MapReduceMerge.
   * Execute the plan
   
- == HbaseRDF Data Materializer ==
+ === HbaseRDF Data Materializer ===
  HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the 
results
  into a Hbase table. Later, HQP uses those materialized data for efficient 
processing of 
  RDF path queries. 
  
- == Hbase Shell Extention ==
+ === Hbase Shell Extension ===
- === Hbase Shell - RDF Shell ===
+ 
  {{{
  Hbase > rdf;
  
@@ -134, +117 @@

  Hbase > 
  }}}
  
- === Hbase SPARQL ===
-  * Support for the full SPARQL syntax
-  * Support for a syntax to load RDF data into an Hbase table
- 
  == Alternatives ==
   * A triples table stores RDF triples in a single table with three 
attributes, subject, property, and object.
  
@@ -146, +125 @@

   * A dicomposed storage model (DSM), one table for each property, sorted by 
the subject. Used in C-Store.
    * ''Actually, the discomposed storage model is almost the same as the 
storage model in Hbase.''
  
- == Hbase Storage for RDF ==
- 
- ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- 
udanax''-~
- 
- 
[http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]
- 
- == Hbase RDF Storage Subsystems Architecture ==
-  * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture.
-  * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language.
- 
- ----
- = Papers =
+ == Papers ==
  
   * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - 
proposes a very simple, but powerfull, and highly parallelized data processing 
technique.-~
   * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf 
Column-Stores For Wide and Sparse Data]'' - discusses the benefits of using 
C-Store to store RDF and XML data.-~

[Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong

Reply via email to