Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by AlanGates:

New page:
= Owl, Metadata for Hadoop =

A metadata system for the grid will allow users and applications to register 
and search for data on the grid.  Metadata registration will not be required; 
but if available it can greatly improve locating and managing data on the grid. 
 Certain metadata, such as schema and statistics, can improve robustness and 
performance of applications such as Pig. 

== Intended Users ==
The metadata system will provide an API for use by Map Reduce jobs, Pig, and 
grid data maintenance tools (such as cleaning tools, archive tools, etc.).  It 
will provide a GUI for end users and grid

== Use Cases ==

=== Browsing Available Data ===
Actor:  User
 1. User clicks browse button on GUI
 1. User is able to navigate through different Hadoop grids, down the heirarchy 
to individual files and directories that represent a data set.
 1. At any point user is able to click on an individual data set to get more 

=== Search for Data ===
Actor:  User
 1. User clicks on search button on GUI.
 1. User enters attribute data he wishes to search on, such as feed name, date 
of creation, or other attributes the data has been tagged with.
 1. System returns a list of files and directories that have the specified tags 
and for which the user has permissions.

Actor:  Data Reader, such as Pig script or Map Reduce job.
 1. Data reader searches for data via API.  Search may be done via pathname, or 
attributes of the data.
 1. System returns a list of files and directories that match the search 
criteria and for which the user has permissions.
Note, searches can be restricted to a portion of the data hierarchy, such as 
only in a given set of data or in a given administrative domain.

=== Data Creation ===
Actor:  Data Creator, may be a Pig script, Map Reduce job, or external system 
loading files onto HDFS.
 1. Data creator creates data, possibly all in one file or directory or 
possibly in several directories under a common directory.
 1. Once data creation is completed the data creator notifies the metadata 
system that the data is available.
 1. Metadata system makes data available to other users via notification (see 
below), browsing, and search.
 1. In the case where several files or directories all under one directory are 
being created, the data creator can choose to register each individual file or 
directory as it becomes available.  It can also choose to register only the top 
level directory when finished, at which point all data under that directory 
will be available.

=== Data Notification ===
Actor:  Metadata
 1. Data made available by data creator.
 1. Metadata system logs data availability to publicly available feed (such as 
 1. Interested users can subscribe to feed and discover new data sets.

=== Processing data via Pig ===
Actor:  Pig
 1. As part of load statement, Pig contacts the metadata system to find schema, 
loader, and statistics associated with the data.  These can be used to do 
compile time checking (e.g. type checking), find the correct load function for 
a file, and perform optimizations.
 1. When Pig stores data, the user can choose to have it record metadata 
associated with the stored data.

== Requirements ==
 1. It will be possible to browse, search, and create metadata from Map Reduce 
 1. It will be possible to browse, search, and create metadata from Pig Latin 
 1. It will be possible to browse, search, and administrate metadata from a GUI.
 1. It will be possible for data maintenance tools (such as cleaning tools, 
archive tools, replication tools, etc.) to use the metadata system to track 
files and directories that they need to maintain.
 1. The metadata system will support multiple administrative domains.  These 
domains will allow groups of users to tag their data using a common set of 
attributes and store their data in a common set of directories.
 1. The metadata system will allow administrators to control which users can 
read and write metadata.  This control may be done at the administrative domain 
level rather than at the individual data set level.  For example, users would 
be able to read all metadata included in an administrative domain they have 
access to.
 1. The metadata system will support notifying users when new data is 
available.  Note that this notification is not intended to replace or suplant a 
workflow system, but rather provide necessary information for a work flow 
system that can offer much more sophisticated features.
 1. The metadata system will remain optional; Pig, Map Reduce, and HDFS will 
continue to work with or without it.
 1. Users will be able to tag their data with key value pairs they define.

== Overview of Architecture ==
A persistent storage mechanism will be needed to store the metadata.  Storing 
the metadata in HDFS was considered, but this requires the use of an indexing 
system in
order to facilitate fast search and the use of a locking mechanism to avoid 
read and write conflicts.  For these reasons, an RDBMS will be used to store 
the persistent
data.  The system will be designed in such a way that any SQL-92 compliant 
RDBMS can be used.  Some large metadata items (e.g. a histogram of the keys in 
a file) may be
stored in the HDFS to avoid overloading the RDBMS.

A REST based web services API will be used for communications between clients 
and the metadata service.  Web services was chosen because it frees the 
metadata system from
needing to provide bindings for the various languages that users will want to 
use to communicate with the system. REST was chosen as a web services protocol 
because of
its ease of use and ubiqituous support.

== Data Model ==
Metadata will be modeled using the following concepts:

Facets:  A Facet is a key value pair that can be associated with data.  It can 
be used by users to tag their data, by tools to record information about the 
data, etc.
For example, a user could assign a Facet of `priority: high` and an cleaning 
tools could assign a Facet of `expirationdate:20090701`.  Certain Facets can be 
required (see below).  Users can
also define their own Facets and associate them with Data Collections or Data 

Catalog:  A Catalog is an administrative domain.  A group of users working with 
similar data can create a Catalog.  Within a Catalog it will be possible to 
mandate the
use of certain Facets.  For example, it could be required that all data in a 
given Catalog must have a priority Facet or a datestamp Facet.  A Catalog is 
associated with one or
more directories in HDFS.  It will be possible to define which users can read 
metadata in a Catalog, and which users can write metadata in a Catalog.

Data Collection:  A Data Collection is a logical collection of data.  It is 
contained within a Catalog.  It is associated with a directory in HDFS.  It 
defines which
Facets are used to partition the data in it into separate Data Units (see 
below).  Facets can be attached to a Data Collection.

Data Unit:  A unit of data (either a file or directory of part or map files) 
that Pig or Map Reduce can operate on.  Data units are contained within a Data 
Collection or
other Data Units.  Users can associate Facets with Data Units.  Schemas and 
statistcs can be associated with Data Units.  Facets attached to a Data 
Collection or data
unit that contains a given Data Unit are inherited by that Data Unit.  Data 
units will not usually be map or part files, but a collection of map or part 
making up a unit of data to be processed by Pig or Map Reduce.


Figure 1:  Data model.


Figure 2:  An example of storing data in owl.

== Alternatives Considered ==
Hive, a subproject of Hadoop, does currently have a metadata management system. 
 It presents a relational model to users as part of Hive's SQL interface.  This 
is a good
fit for Hive, but does not fit well with Map Reduce, Pig, and grid data 
maintenance tools that view the grid as a large file system.

Reply via email to