Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by InchulSong: http://wiki.apache.org/lucene-hadoop/Hbase/RDF ------------------------------------------------------------------------------ [[TableOfContents(4)]] ---- - = Hbase RDF Storage Subsystems = + = HbaseRDF, an Hbase subsystem for RDF = -- ''Any comments on HbaseRDF are welcomed.'' @@ -21, +21 @@ * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] (Research and Development center, NHN corp.) * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab. , KAIST) - == Background == + == Considerations == + The Sawzall paper from Google says that MapReduce framework + is not good for table joins. It is possible, but while we are reading one table + in map or reduce functions, we have to read other tables on the fly. + There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 07). - ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- udanax''-~ - - [http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg] - - == Rationale == - - * What is RDF - * Previous methods for storing RDF data and processing queries - * Their weak points - * The method in Hbase - * Strong points - - == Considerations == - The Sawzall paper says that the record-at-a-time model is not good for table joins. - We think this problem occurs for typical join operations. - - Let us think how to implement general join operations with MapReduce. - When we perform a nested loop join, for each tuple in the outer table, - table R, we have to go through the inner table, table S, to find - all the joining tuples. To perform nested loop joins with MapReduce, - we divide table R into M partitions, and each - map worker joins one partition of the table at a time with S. - Each map worker produces the amount of join results corresponding - to the partition it is in charge of. - - For merge joins, since table R is already sorted by - subject, each map worker only needs to read logN tuples in table S, where - N is the number of tuples in S, which results in fast join performance. - - There is a paper on this subject - written by Yang. et al, from Yahoo (SIGMOD 07). The paper provides - Map-Reduce-Merge, which is an extended version of MapReduce, + The paper provides Map-Reduce-Merge, which is an extended version of the MapReduce framework, - that implements several relational operators including joins. + that implements several relational operators, including joins. They have extended the + MapReduce framework with an additional Merge phase to implement efficient data relationship processing. See the Paper section below for more information. -- Thanks stack. - - The key parts are how efficienly MapReduce performs joins and - how cleverly we materialize join results for efficient query processing of RDF path queries. Especially, the formal part is the key to beat - C-Store query performance: defeat C-Store with massively parallelized query - processing. But the problem is that there is an initial delay in executing MapReduce jobs due to the time spent in assigning the computations to multiple machines. This @@ -69, +38 @@ We might consider a selective parallelism where people can decide whether to use MapReduce or not to process their queries, as in "select ... '''in parallel'''". + + Now that we have two sets of join algorithms, non-parallel versions and parallel versions with MapReduceMerge, + we are ready to do some massive parallel query processing on tremendous amount of RDF data. + Currently, C-Store shows the best query performance on RDF data. + However, we, armed with Hbase and MapReduceMerge, can do even better. == HbaseRDF Data Loader == HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data @@ -157, +131 @@ * A dicomposed storage model (DSM), one table for each property, sorted by the subject. Used in C-Store. * ''Actually, the discomposed storage model is almost the same as the storage model in Hbase.'' + == Hbase Storage for RDF == + + ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- udanax''-~ + + [http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg] + + == Rationale == + + * What is RDF + * Previous methods for storing RDF data and processing queries + * Their weak points + * The method in Hbase + * Strong points == Food for thought == * What are the differences between Hbase and C-Store.