[Pig Wiki] Update of "owl" by jaytang

Apache Wiki Thu, 01 Apr 2010 14:55:42 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=11&rev2=12

--------------------------------------------------

  
  = Apache Owl Wiki =
  
- The goal of Owl is to provide a high level data management abstraction.  
!MapReduce and Pig applications interacting directly with HDFS directories and 
files must deal with low level data management issues such as storage format, 
serialization/compression schemes, data layout, and efficient data accesses, 
etc, often with different solutions. Owl aims to provide a standard way to 
addresses this issue and abstracts away the complexities of reading/writing 
huge amount of data from/to HDFS.
+ == Vision ==
  
- Owl provides a tabular view of data on Hadoop and thus supports the notion of 
''Owl Tables''.  Conceptually, it is similar to a relation database table.  An 
Owl Table has these characteristics:
+ Owl provides a more natural abstraction for Map-Reduce and Map-Reduce-based 
technologies (e.g., Pig, SQL) by allowing developers to express large datasets 
as tables, which in turn consist of rows and columns. Owl tables are similar, 
but not identical to familiar database / data warehouse tables.
  
+ The core M/R programming interface as we know it (the mapper, reducer, output 
collector, record reader and input format ) all deal with collection of 
abstract data objects, not files. However, the current set of !InputFormat 
implementations provided by job API are relatively primitive and are heavily 
coupled to file formats and HDFS paths to describe input and output locations. 
From an application programmer’s perspective, one has to think about both the 
abstract data and the physical representation and storage location, which is a 
disconnect from the abstract data API. In the meantime, the number of file 
formats and (de)serialization libraries have flourished in the Hadoop 
community. Some of these require certain metadata to operate/optimize. While 
providing optimization and performance enhancements, these file formats and 
SerDe libs don’t make it any easier to develop applications on and manage very 
big data sets. 
-    * lives in an Owl database name space and could contain multiple partitions
-    * has columns and rows and supports a unified table level schema
-    * interface to supports !MapReduce and Pig Latin and can easily work with 
other languages
-    * designed for efficient batch read/write operations, partitions can be 
added or removed from a table
-    * supports external tables (data already exists on file system)
-    * pluggable architecture for different storage format such as Zebra
-    * presents a logically partitioned view of data and supports very large 
data set via its multi-level flexible partitioning scheme
-    * efficient data access mechanisms over very large data set via partition 
and projection pruning
  
  
- Owl has two major public APIs.  ''Owl Driver'' provides management APIs 
against three core Owl abstractions: "Owl Table", "Owl Database", and 
"Partition".  This API is backed up by an internal Owl metadata store that runs 
on Tomcat and a relational database.  ''!OwlInputFormat'' provides a data 
access API and is modeled after the traditional Hadoop !InputFormat.  In the 
future, we plan to support ''!OwlOutputFormat'' and thus the notion of "Owl 
Managed Table" where Owl controls the data flow into and out of "Owl Tables".  
Owl also supports Pig integration with OwlPigLoader/Storer module.
+ == High Level Diagram == 
  
- Initially, we like to open source Owl as a Pig contrib project.  In the long 
term, Owl could become a separate Hadoop subproject as it provides a platform 
service all Hadoop applications.
+ As one can see, Owl gives Hadoop users a uniform interface for organizing, 
discovering and managing data stored in many different formats, and to promote 
interoperability among different programming frameworks. Owl presents a single 
logical view of data organization and hides the complexity and evolutions in 
underlying physical data layout schemes. It gives Hadoop applications a stable 
foundation to build upon. 
+ 
+ == Main Properties and Features ==
+ 
+ 
+ || Feature || Status ||
+ || Owl is a stand-alone table store, not tied to any particular data query or 
processing languages, currently supporting MR, Pig Latin, and Pig SQL || 
current ||
+ || Owl has a flexible data partitioning model, with multiple levels of 
partitioning, physical and logical partitioning, and partition pruning for 
query optimization || current ||
+ || Owl has a flexible interface for pushing projections and filters all the 
way down || current ||
+ || Owl has a framework for storing data in many storage formats, and 
different storage formats can co-exist within the same table || current ||
+ || Owl provides capability discovery mechanism to allow applications to take 
advantage of unique features of storage drivers || current ||
+ || Owl supports both managed tables (completely managed by Owl) and unmanaged 
tables (called "external tables" in many databases) || currently support 
external tables ||
+ || Owl manages a unified schema for each table (by unifying the schemas of 
its partitions) || current ||
+ || Owl has support for storing custom metadata associated with tables and 
partitions || current ||
+ || Owl has support for automatic data retention management || future ||
+ || Owl has support for notifications on data change (new data added, data 
restated, etc. ) || future ||
+ || Owl has support for converting data between write-friendly and 
read-friendly formats || future ||
+ || Owl has support for addressing HDFS NameNode limitations by decreasing the 
number of files needed to store very large data sets || future ||
+ || Owl provides a security model for secure data access || future ||
  
  
  == Prerequisite ==
@@ -30, +41 @@

  Owl depends on Pig for its tuple classes as its basic unit of data container, 
and Hadoop 20 for !OwlInputFormat.  Its first release will require Pig 0.7 and 
Hadoop 20.2.  Owl also requires a storage driver; Owl integrates with Zebra 0.7 
out-of-the-box.
  
  == Getting Owl ==
+ 
+ Initially, we like to open source Owl as a Pig contrib project.  In the long 
term, Owl could become a separate Hadoop subproject as it provides a platform 
service all Hadoop applications.
  
  Owl would live as a Pig contrib project at:
  
@@ -89, +102 @@

      * deploy owl war file to Tomcat
      * set up -Dorg.apache.hadoop.owl.xmlconfig=<full path to 
owlServerConfig.xml> for the Tomcat deployment
  
- == Sample Code == 
+ == Developing on Owl == 
  
- Owl comes with a Java-based client.  Client API Javadoc is at 
[[attachment:owlJavaDoc.jar]]
+ Owl has two major public APIs.  ''Owl Driver'' provides management APIs 
against three core Owl abstractions: "Owl Table", "Owl Database", and 
"Partition".  This API is backed up by an internal Owl metadata store that runs 
on Tomcat and a relational database.  ''!OwlInputFormat'' provides a data 
access API and is modeled after the traditional Hadoop !InputFormat.  In the 
future, we plan to support ''!OwlOutputFormat'' and thus the notion of "Owl 
Managed Table" where Owl controls the data flow into and out of "Owl Tables".  
Owl also supports Pig integration with OwlPigLoader/Storer module.
+ 
+ Client API Javadoc is at [[attachment:owlJavaDoc.jar]]
      * Owl driver API - org.apache.hadoop.owl.client
      * !OwlInputFormat API - org.apache.hadoop.owl.mapreduce

[Pig Wiki] Update of "owl" by jaytang

Reply via email to