Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by udanax: http://wiki.apache.org/lucene-hadoop/Hbase/RDF New page: [[TableOfContents(4)]] ---- = Hbase RDF Storage Subsystems = We Start to think about storing and querying RDF and RDF Schema in Hbase. [[BR]]but we'll do it at the last after prudence investigation and Hbase shell's HQL, Altools POC review. We propose a Hbase subsystem for RDF called HbaseRDF, which uses Hbase + MapReduce to store RDF data and execute queries (e.g., SPARQL) on them. We can store very sparse RDF data in a single table in Hbase, with as many columns as they need. For example, we might make a row for each RDF subject in a table and store all the properties and their values as columns in the table. This reduces costly self-joins, which results in efficient processing of queries, although we still need self-joins for RDF path queries. We can further accelerate query performance by using MapReduce for parallel, distributed query processing. == Initial Contributors == * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] (Research and Development center, NHN corp.) * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab Division of Computer Science, KAIST) == Background == ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- udanax''-~ [http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg] == Rationale == * What is RDF * Previous methods for storing RDF data and processing queries * Their weak points * The method in Hbase * Strong points == Considerations == The Sawzall paper says that the record-at-a-time model is not good for table joins. I think this problem occurs in typical join operations. Think about what kinds of join operations Sawzall or MapReduce can perform efficienly in parallel, or possibly process them at all at the first time. When we perform a nested loop join, for each tuple in the outer table, table R, we have to go through the inner table, table S, to find all the joining tuples. To perform nested loop joins with MapReduce, we divide table R into M partitions, and each map worker joins one partition of the table at a time with S. Each map worker produces the amount of join results corresponding to the partition it is in charge of. However, in merge joins, since table R is already sorted by subject, each map worker only needs to read logN tuples in table S, where N is the number of tuples in S, which results in fast join performance. C-Store also says that join operations can be performed efficiently in DSM because tables in DSM are sorted by subject, thus we can use merge joins on a sorted attribute. The key parts are how fast Sawzall or MapReduce performs merge joins and how cleverly we materialize join results for efficient query processing of RDF path queries. Especially, the formal part is the key to beat C-Store query performance: defeat C-Store with massively parallelized query processing. But the problem is that there is an initial delay in executing MapReduce jobs due to the time spent in assigning the computations to multiple machines. This might take far more time than necessary, thus hurt query response time. So, parallelism obtained by using MapReduce is best enjoyable for queries over huge amount of RDF data, where it takes much time to process them. We might consider a selective parallelism where people can decide whether to use MapReduce or not to process their queries, as in "select ... '''in parallel'''". == HbaseRDF Data Loader == HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data into a Hbase table in such a way that efficient query processing is possible. It reads a triple at a time and inserts the triple into a Hbase table as follows: {{{#!python numbering=off value_count = 0 for s, p, o in triples: insert into rdf_table ('p:value_count') values ('o') where row='s' value_count = value_count + 1 }}} Examples with the data from C-Store. {{{#!CSV ; Subj.; Prop.; Obj. ID1; type; BookType ID1; title; âXYZâ ID1; author; âFox, Joeâ ID1; copyright; â2001â ID2; type; CDType ID2; title; âABCâ ID2; artist; âOrr, Timâ ID2; copyright; â1985â ID2; language; âFrenchâ ID3; type; BookType ID3; title; âMNOâ ID3; language; âEnglishâ ID4; type; DVDType ID4; title; âDEFâ ID5; type; CDType ID5; title; âGHIâ ID5; copyright; â1995â ID6; type; BookType ID6; copyright; â2004â }}} == HbaseRDF Query Processor == HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table. It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns the results to the user. Query processing steps are as follows: * Parsing, in which a parse tree, representing the SPARQL query is constructed. * Query rewrite, in which the parse tree is converted to an initial query plan, which is, in turn, transformed into an equivalent plan that is expected to require less time to execute. We have to choose which algorithm to use for each operation in the selected plan. Among them are MapReduce jobs for parallel algorithms. * Execute the plan == HbaseRDF Data Materializer == HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the results into a Hbase table. Later, HQP uses those materialized data for efficient processing of RDF path queries. == Hbase Shell Extention == === Hbase Shell - RDF Shell === {{{ Hbase > rdf; Hbase RDF version 0.1 Type 'help;' for help. Hbase.RDF > SELECT ?title > FROM rdf_table > WHERE { ?book author ââFox, Joeââ > ?book copyright ââ2001ââ > ?book title ?title } results here. Hbase.RDF > exit; Hbase > }}} === Hbase SPARQL === * Support for the full SPARQL syntax * Support for a syntax to load RDF data into an Hbase table == Alternatives == * A triples table stores RDF triples in a single table with three attributes, subject, property, and object. * A property table. Put properties frequently queried togather into a single table to reduce costly self-joins. Used in Jena and Oracle. * A dicomposed storage model (DSM), one table for each property, sorted by the subject. Used in C-Store. * ''Actually, the discomposed storage model is almost the same as the storage model in Hbase.'' == Food for thought == * What are the differences between Hbase and C-Store. * Is DSM suitable for Hbase? * How to translate SPARQL queries into MapReduce functions, or Hbase APIs. == Hbase RDF Storage Subsystems Architecture == * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture. * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language. ---- = Papers = * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - proposes a very simple, but powerfull, and highly parallelized data processing technique.-~ * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For Wide and Sparse Data]'' - discusses the benefits of using C-Store to store RDF and XML data.-~ * ~-VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable Semantic Web Data Management Using Vertical Partitoning]'' - proposes an efficient method to store RDF data in table projections (i.e., columns) and executes queries on them.-~