[Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong

Apache Wiki Fri, 17 Aug 2007 23:53:53 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by InchulSong:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

The comment on the change is:
related projects added

------------------------------------------------------------------------------
  [[TableOfContents(4)]]
  ----
- = HbaseRDF, an Hbase subsystem for RDF =
+ = HbaseRDF, an Hbase Subsystem for RDF =
  
   -- ''Any comments on HbaseRDF are welcomed.''
  
@@ -16, +16 @@

  We can further accelerate query performance by using MapReduce for 
  parallel, distributed query processing. 
  
+ === Related projects ===
+   * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 Relational 
Algrebra Operators] is designing and implementing relational algebra operators. 
See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans Algebric Tools] for 
various algebric operators we are designing and planing to implement, including 
relational algebra operators.
+   * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] 
provides a command line tool in which we can manipulate tables in Hbase. We are 
also planning to use HbaseShell to manipulate and query RDF data to be stored 
in Hbase.
+  
  == Initial Contributors ==
  
-  * [:udanax:Edward Yoon] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] 
(Research and Development center, NHN corp.)
+  * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] 
(Research and Development center, NHN corp.)
   * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] 
(Database Lab. , KAIST) 
  
  == Considerations ==
- The Sawzall paper from Google says that MapReduce framework 
- is not good for table joins. It is possible, but  while we are reading one 
table 
+ When we store RDF data in a single Hbase table and process queries on them, 
an important issue we have to consider is how to reduce costly self-joins 
needed to process RDF path queries. 
+ 
+ To speed up these costly self-joins, it is natural to think about using 
+ the MapReduce framework we already have. However, in the Sawzall paper from 
Google, the authors say that the MapReduce framework is 
+ not good, or inappropriate for performing table joins. 
+ It is possible, but while we are reading one table in map 
- in map or reduce functions, we have to read other tables on the fly.
+ or reduce functions, we have to read other tables on the fly, which
+ results in less parallelized join processing.
  
  There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 
07). 
  The paper provides Map-Reduce-Merge, which is an extended version of the 
MapReduce framework, 
  that implements several relational operators, including joins. They have 
extended the 
  MapReduce framework with an additional Merge phase to implement efficient 
data relationship processing.
  See the Paper section below for more information. -- Thanks stack.
+ (Somebody help us here!)
  
  But the problem is that there is an initial delay in executing MapReduce jobs 
due to 
  the time spent in assigning the computations to multiple machines. This 
@@ -46, +56 @@

  
  == HbaseRDF Data Loader ==
  HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
- into a Hbase table in such a way that efficient query processing is possible.
+ into a Hbase table in such a way that efficient query processing is possible. 
In Hbase, we can store everything in a single table.
+ The sparsicy of RDF data is not a problem, because Hbase, which is 
+ a column-based storage and adopts various compression techniques, 
+ is very good at dealing with nulls in the table
+ 
- It reads a triple at a time and inserts the triple into a Hbase table as 
follows:
+ HDL reads a triple at a time and inserts the triple into a Hbase table as 
follows:
  
  {{{#!python numbering=off
  value_count = 0
@@ -90, +104 @@

  Query processing steps are as follows:
  
   * Parsing, in which a parse tree, representing the SPARQL query is 
constructed.
-  * Query rewrite, in which the parse tree is converted to an initial query 
plan, which is, in turn, transformed into an equivalent plan that is expected 
to require less time to execute. We have to choose which algorithm to use for 
each operation in the selected plan. Among them are MapReduce jobs for parallel 
algorithms.
+  * Query rewrite, in which the parse tree is converted to an initial query 
plan, which is, in turn, transformed into an equivalent plan that is expected 
to require less time to execute. We have to choose which algorithm to use for 
each operation in the selected plan. Among them are parallel versions of 
algorithms, such as parallel joins with MapReduceMerge.
   * Execute the plan
   
  == HbaseRDF Data Materializer ==
@@ -137, +151 @@

  
  
[http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]
  
- == Rationale ==
- 
-  * What is RDF
-  * Previous methods for storing RDF data and processing queries
-   * Their weak points
-  * The method in Hbase
-   * Strong points
- 
- == Food for thought ==
-  * What are the differences between Hbase and C-Store.
-  
-  * Is DSM suitable for Hbase?
- 
-  * How to translate SPARQL queries into MapReduce functions, or Hbase APIs. 
- 
  == Hbase RDF Storage Subsystems Architecture ==
- 
   * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture.
   * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language.

[Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong

Reply via email to