Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change
The following page has been changed by AlanGates:
= Owl, Metadata for Hadoop =
A metadata system for the grid will allow users and applications to register
and search for data on the grid. Metadata registration will not be required;
but if available it can greatly improve locating and managing data on the grid.
Certain metadata, such as schema and statistics, can improve robustness and
performance of applications such as Pig.
== Intended Users ==
The metadata system will provide an API for use by Map Reduce jobs, Pig, and
grid data maintenance tools (such as cleaning tools, archive tools, etc.). It
will provide a GUI for end users and grid
== Use Cases ==
=== Browsing Available Data ===
1. User clicks browse button on GUI
1. User is able to navigate through different Hadoop grids, down the heirarchy
to individual files and directories that represent a data set.
1. At any point user is able to click on an individual data set to get more
=== Search for Data ===
1. User clicks on search button on GUI.
1. User enters attribute data he wishes to search on, such as feed name, date
of creation, or other attributes the data has been tagged with.
1. System returns a list of files and directories that have the specified tags
and for which the user has permissions.
Actor: Data Reader, such as Pig script or Map Reduce job.
1. Data reader searches for data via API. Search may be done via pathname, or
attributes of the data.
1. System returns a list of files and directories that match the search
criteria and for which the user has permissions.
Note, searches can be restricted to a portion of the data hierarchy, such as
only in a given set of data or in a given administrative domain.
=== Data Creation ===
Actor: Data Creator, may be a Pig script, Map Reduce job, or external system
loading files onto HDFS.
1. Data creator creates data, possibly all in one file or directory or
possibly in several directories under a common directory.
1. Once data creation is completed the data creator notifies the metadata
system that the data is available.
1. Metadata system makes data available to other users via notification (see
below), browsing, and search.
1. In the case where several files or directories all under one directory are
being created, the data creator can choose to register each individual file or
directory as it becomes available. It can also choose to register only the top
level directory when finished, at which point all data under that directory
will be available.
=== Data Notification ===
1. Data made available by data creator.
1. Metadata system logs data availability to publicly available feed (such as
1. Interested users can subscribe to feed and discover new data sets.
=== Processing data via Pig ===
1. As part of load statement, Pig contacts the metadata system to find schema,
loader, and statistics associated with the data. These can be used to do
compile time checking (e.g. type checking), find the correct load function for
a file, and perform optimizations.
1. When Pig stores data, the user can choose to have it record metadata
associated with the stored data.
== Requirements ==
1. It will be possible to browse, search, and create metadata from Map Reduce
1. It will be possible to browse, search, and create metadata from Pig Latin
1. It will be possible to browse, search, and administrate metadata from a GUI.
1. It will be possible for data maintenance tools (such as cleaning tools,
archive tools, replication tools, etc.) to use the metadata system to track
files and directories that they need to maintain.
1. The metadata system will support multiple administrative domains. These
domains will allow groups of users to tag their data using a common set of
attributes and store their data in a common set of directories.
1. The metadata system will allow administrators to control which users can
read and write metadata. This control may be done at the administrative domain
level rather than at the individual data set level. For example, users would
be able to read all metadata included in an administrative domain they have
1. The metadata system will support notifying users when new data is
available. Note that this notification is not intended to replace or suplant a
workflow system, but rather provide necessary information for a work flow
system that can offer much more sophisticated features.
1. The metadata system will remain optional; Pig, Map Reduce, and HDFS will
continue to work with or without it.
1. Users will be able to tag their data with key value pairs they define.
== Overview of Architecture ==
A persistent storage mechanism will be needed to store the metadata. Storing
the metadata in HDFS was considered, but this requires the use of an indexing
order to facilitate fast search and the use of a locking mechanism to avoid
read and write conflicts. For these reasons, an RDBMS will be used to store
data. The system will be designed in such a way that any SQL-92 compliant
RDBMS can be used. Some large metadata items (e.g. a histogram of the keys in
a file) may be
stored in the HDFS to avoid overloading the RDBMS.
A REST based web services API will be used for communications between clients
and the metadata service. Web services was chosen because it frees the
metadata system from
needing to provide bindings for the various languages that users will want to
use to communicate with the system. REST was chosen as a web services protocol
its ease of use and ubiqituous support.
== Data Model ==
Metadata will be modeled using the following concepts:
Facets: A Facet is a key value pair that can be associated with data. It can
be used by users to tag their data, by tools to record information about the
For example, a user could assign a Facet of `priority: high` and an cleaning
tools could assign a Facet of `expirationdate:20090701`. Certain Facets can be
required (see below). Users can
also define their own Facets and associate them with Data Collections or Data
Catalog: A Catalog is an administrative domain. A group of users working with
similar data can create a Catalog. Within a Catalog it will be possible to
use of certain Facets. For example, it could be required that all data in a
given Catalog must have a priority Facet or a datestamp Facet. A Catalog is
associated with one or
more directories in HDFS. It will be possible to define which users can read
metadata in a Catalog, and which users can write metadata in a Catalog.
Data Collection: A Data Collection is a logical collection of data. It is
contained within a Catalog. It is associated with a directory in HDFS. It
Facets are used to partition the data in it into separate Data Units (see
below). Facets can be attached to a Data Collection.
Data Unit: A unit of data (either a file or directory of part or map files)
that Pig or Map Reduce can operate on. Data units are contained within a Data
other Data Units. Users can associate Facets with Data Units. Schemas and
statistcs can be associated with Data Units. Facets attached to a Data
Collection or data
unit that contains a given Data Unit are inherited by that Data Unit. Data
units will not usually be map or part files, but a collection of map or part
making up a unit of data to be processed by Pig or Map Reduce.
Figure 1: Data model.
Figure 2: An example of storing data in owl.
== Alternatives Considered ==
Hive, a subproject of Hadoop, does currently have a metadata management system.
It presents a relational model to users as part of Hive's SQL interface. This
is a good
fit for Hive, but does not fit well with Map Reduce, Pig, and grid data
maintenance tools that view the grid as a large file system.