[Hadoop Wiki] Trivial Update of "HRDF" by Frederick Haebin Na

Apache Wiki Sun, 26 Oct 2008 21:15:37 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by Frederick Haebin Na:
http://wiki.apache.org/hadoop/HRDF

------------------------------------------------------------------------------
+ HRDF project changed its name to 
[http://wiki.apache.org/incubator/HeartProposal HEART] (Highly Extensible & 
Accumulative RDF Table).
- [[TableOfContents(4)]]
- ----
-  * I'm looking for champion/mentor who can leads the proposal process.
-  * http://wiki.apache.org/incubator/HRdfStoreProposal
-  * http://code.google.com/p/hrdf/
  
- == HRDF, a Planet-Scale RDF Data Store ==
- 
- We have started to think about storing and querying RDF data in Hadoop + 
Hbase. But we'll jump into its implementation after prudence investigation. 
- 
- We introduce an Hadoop subsystem for RDF, called HRDF, which uses Hbase + 
!MapReduce to store RDF data and execute queries (e.g., SPARQL) on them.
- We can store very sparse RDF data in a single table in Hbase, with as many 
columns as 
- they need. For example, we might make a row for each RDF subject in a table 
and store all the properties and their values as columns in the table. 
- This reduces costly self-joins in answering queries asking questions on the 
same subject, which results in efficient processing of queries, although we 
still need self-joins to answer RDF path queries.
- 
- We can further accelerate query performance by using !MapReduce for 
- parallel, distributed query processing. 
-  
- === Initial Contributors ===
- 
-  * [:udanax:Edward Yoon] (R&D center, NHN corp.)
-  * [:InchulSong: Inchul Song] (Database Lab, KAIST) 
-  * [http://www.openrdf.org/forum/mvnforum/viewthread?thread=1423 A forum at 
Aduna/Sesame] would be interested in working with this group.
- 
- ----
- == Some Ideas ==
- When we store RDF data in a single Hbase table and process queries on them, 
an important issue we have to consider is how to efficiently perform costly 
self-joins needed to process RDF path queries. 
- 
- To speed up these costly self-joins, it is natural to think about using 
- the !MapReduce framework we already have. However, in the Sawzall paper from 
Google, the authors say that the !MapReduce framework is 
- not good, or inappropriate for performing table joins. 
- It is possible, but while we are reading one table in map 
- or reduce functions, we have to read other tables on the fly, which
- results in less parallelized join processing.
- 
- There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 
07). 
- The paper provides Map-Reduce-Merge, which is an extended version of the 
!MapReduce framework, 
- that implements several relational operators, including joins. They have 
extended the 
- !MapReduce framework with an additional Merge phase to implement efficient 
data relationship processing.
- See the Paper section below for more information. -- Thanks stack.
- (Edward is now implementing join operators using the !MapReduce framework.)
- 
- But the problem is that there is an initial delay in executing !MapReduce 
jobs due to 
- the time spent in assigning the computations to multiple machines. This 
- might take far more time than necessary, thus hurt query response time. So, 
parallelism obtained by using !MapReduce is best enjoyable for queries over 
huge amount of RDF data, where it takes much time to process them. 
- We might consider a selective parallelism where 
- people can decide whether to use !MapReduce or not to process their queries, 
as in 
- "select ... '''in parallel'''".
- 
- Now that we have two sets of join algorithms, non-parallel versions and 
parallel versions with !MapReduceMerge,
- we are ready to do some massive parallel query processing on tremendous 
amount of RDF data.
- Currently, C-Store shows the best query performance on RDF data.
- However, we, armed with Hbase and !MapReduceMerge, can do even better.
- ----
- == Resources ==
-  * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a 
candidate recommendation of W3C as of 14 June 2007.
-  * A test suit for SPARQL can be found at 
http://www.w3.org/2001/sw/DataAccess/tests/r2. The web page provides test RDF 
data, SPARQL queries, and expected results.
-  * [https://jena.svn.sourceforge.net/svnroot/jena/ARQ/trunk/Grammar/sparql.jj 
SPARQL Grammer in JavaCC] - from Jena ARQ
-  * [http://esw.w3.org/topic/LargeTripleStores Large triple stores]
-  * [http://web.mit.edu/dna/www/abadirdf.pdf Scalable Semantic Web Data 
Management Using Vertical Partitioning] Good summary of techniques storing RDF 
in RDBMS.
-  * [http://www4.wiwiss.fu-berlin.de/benchmarks-200801/ RDF Store Benchmarks] 
with DBpedia
- 
- == Architecture Sketch ==
- 
- === HRDF Data Loader ===
- HRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
- into a Hbase table in such a way that efficient query processing is possible. 
In Hbase, we can store everything in a single table.
- The sparsicy of RDF data is not a problem, because Hbase, which is 
- a column-based storage and adopts various compression techniques, 
- is very good at dealing with nulls in the table
- 
- Figure 1. cell of space-time
- 
- === HRDF Query Processor ===
- HRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase 
table. 
- It translates RDF queries into API calls to Hbase, or !MapReduce jobs, 
gathers and returns the results
- to the user. 
- 
- Query processing steps are as follows:
- 
- {{{
- SPARQL query -> Parse tree -> Logical operator tree 
- -> Physical operator tree -> Execution
- }}}
- 
- Implemenation of each step may proceed as an individual issue. 
- 
- === HRDF Data Materializer ===
- HRDF Data Materializer (HDM) pre-computes RDF path queries and stores the 
results
- into a Hbase table. Later, HQP uses those materialized data for efficient 
processing of 
- RDF path queries. 
- ----
- == Alternatives For RDF Storage ==
-  * A triples table stores RDF triples in a single table with three 
attributes, subject, property, and object.
-  * A property table. Put properties frequently queried togather into a single 
table to reduce costly self-joins. Used in Jena and Oracle. 
-  * A dicomposed storage model (DSM), one table for each property, sorted by 
the subject. Used in C-Store.
- ----
- == Papers ==
- 
-  * OSDI 2004, ''!MapReduce: Simplified Data Processing on Large Clusters'', 
proposes a very simple, but powerfull, and highly parallelized data processing 
technique.
-  * CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf 
Column-Stores For Wide and Sparse Data]'', discusses the benefits of using 
C-Store to store RDF and XML data.
-  * VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable 
Semantic Web Data Management Using Vertical Partitoning]'', proposes an 
efficient method to store RDF data in table projections (i.e., columns) and 
executes queries on them.
-  * SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on 
Large Clusters'', !MapReduce implementation of several relational operators.
-

[Hadoop Wiki] Trivial Update of "HRDF" by Frederick Haebin Na

Reply via email to