[Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong

Apache Wiki Thu, 16 Aug 2007 23:52:30 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by InchulSong:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

------------------------------------------------------------------------------
  [[TableOfContents(4)]]
  ----
- = Hbase RDF Storage Subsystems =
+ = HbaseRDF, an Hbase subsystem for RDF =
  
   -- ''Any comments on HbaseRDF are welcomed.''
  
@@ -21, +21 @@

   * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] 
(Research and Development center, NHN corp.)
   * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] 
(Database Lab. , KAIST) 
  
- == Background ==
+ == Considerations ==
+ The Sawzall paper from Google says that MapReduce framework 
+ is not good for table joins. It is possible, but  while we are reading one 
table 
+ in map or reduce functions, we have to read other tables on the fly.
  
+ There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 
07). 
- ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- 
udanax''-~
- 
- 
[http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]
- 
- == Rationale ==
- 
-  * What is RDF
-  * Previous methods for storing RDF data and processing queries
-   * Their weak points
-  * The method in Hbase
-   * Strong points
- 
- == Considerations ==
- The Sawzall paper says that the record-at-a-time model is not good for table 
joins. 
- We think this problem occurs for typical join operations. 
- 
- Let us think how to implement general join operations with MapReduce. 
- When we perform a nested loop join, for each tuple in the outer table, 
- table R, we have to go through the inner table, table S, to find 
- all the joining tuples. To perform nested loop joins with MapReduce,
- we divide table R into M partitions, and each 
- map worker joins one partition of the table at a time with S. 
- Each map worker produces the amount of join results corresponding 
- to the partition it is in charge of.
- 
- For merge joins, since table R is already sorted by 
- subject, each map worker only needs to read logN tuples in table S, where
- N is the number of tuples in S, which results in fast join performance.
- 
- There is a paper on this subject 
- written by Yang. et al, from Yahoo (SIGMOD 07). The paper provides 
- Map-Reduce-Merge, which is an extended version of MapReduce, 
+ The paper provides Map-Reduce-Merge, which is an extended version of the 
MapReduce framework, 
- that implements several relational operators including joins. 
+ that implements several relational operators, including joins. They have 
extended the 
+ MapReduce framework with an additional Merge phase to implement efficient 
data relationship processing.
  See the Paper section below for more information. -- Thanks stack.
- 
- The key parts are how efficienly MapReduce performs joins and 
- how cleverly we materialize join results for efficient query processing of 
RDF path queries. Especially, the formal part is the key to beat 
- C-Store query performance: defeat C-Store with massively parallelized query 
- processing.
  
  But the problem is that there is an initial delay in executing MapReduce jobs 
due to 
  the time spent in assigning the computations to multiple machines. This 
@@ -69, +38 @@

  We might consider a selective parallelism where 
  people can decide whether to use MapReduce or not to process their queries, 
as in 
  "select ... '''in parallel'''".
+ 
+ Now that we have two sets of join algorithms, non-parallel versions and 
parallel versions with MapReduceMerge,
+ we are ready to do some massive parallel query processing on tremendous 
amount of RDF data.
+ Currently, C-Store shows the best query performance on RDF data.
+ However, we, armed with Hbase and MapReduceMerge, can do even better.
  
  == HbaseRDF Data Loader ==
  HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
@@ -157, +131 @@

   * A dicomposed storage model (DSM), one table for each property, sorted by 
the subject. Used in C-Store.
    * ''Actually, the discomposed storage model is almost the same as the 
storage model in Hbase.''
  
+ == Hbase Storage for RDF ==
+ 
+ ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- 
udanax''-~
+ 
+ 
[http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]
+ 
+ == Rationale ==
+ 
+  * What is RDF
+  * Previous methods for storing RDF data and processing queries
+   * Their weak points
+  * The method in Hbase
+   * Strong points
  
  == Food for thought ==
   * What are the differences between Hbase and C-Store.

[Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong

Reply via email to