Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AntonioMagnaghi:
http://wiki.apache.org/pig/PigAbstractionLayer

New page:
##master-page:FrontPage
#format wiki
#language en
#pragma section-numbers off

= Pig Abstraction Layer =

== Introduction and Rational ==

Many of the activities that Pig carries out during the compilation and 
execution stages of Pig Latin queries are, currently, deeply tied to the Hadoop 
file system and Hadoop Map-Reduce paradigm.

For instance, file management tasks, job submission and job tracking in the Pig 
client explicitly assume the availability of a Hadoop cluster to which the 
client connects.

It is possible, however, to envision an architecture where the front-end part 
of the system (i.e. Pig client) may have a more abstract notion of the back-end 
portion. In this context, a Hadoop cluster could be regarded as a particular 
instance amongst a family of different back-ends, all of which provide similar 
functionalities that can be accessed via the same API.

The main motivations behind this proposal can be summarized as follows:

- The availability of well-defined APIs that a back-end needs to support in 
order to run Pig Latin queries can facilitate porting such APIs to different 
platforms. Hence, this could foster wider adoption of Pig.

- Changes in various back-ends can be encapsulated within the actual 
implementation of the generic APIs. Hence, fewer modifications to the front-end 
code-base will result in a more stable code-base.

A proper API design should be general enough to easily support various 
back-ends that are currently supported by Pig like: Hadoop, Galago (see section 
below) and the local back-end (i.e. the local file system and the local 
execution type.)

== Relevant links ==
[http://www.galagosearch.org/ Galago] is a research project started by Trevor 
Strohman at the University of Massachusetts, Amherst. Galago is a search-engine 
with its own execution back-end. 

Galago is able to execute Pig Latin queries by translating them into its own 
representation language (TupleFlow jobs.)

== API Specification ==
The basic functionalities that a back-end may need to export to the Pig client 
could be categorized into two main abstractions:

- '''Data Storage''': provides functionalities that pertain to storing and 
retrieving data. It encapsulates the typical operations supported by file 
systems like creating, opening (for reading or writing) a data object. 

- '''Query Execution/Tracking''': provides functionalities to parse a Pig Latin 
program and submit a compiled Pig job to a back-end. This API should enable the 
front-end to track the current status of a job, its progress, diagnostic 
information and possibly to terminate it.

The sections below provide some initial suggestions for possible APIs for the 
Data Storage and Query Execution abstractions.

=== Back-End Configuration ===
This interface abstracts functionalities for management of configuration 
information for both the Data Storage and Query Execution portions of a 
back-end.

{{{
package org.apache.pig.backend;

import java.io.Serializable;
import java.util.Map;
import java.net.URI;

 /** Abstraction for a generic property object that can be
  * used to specify configuration information, stats...
  * Information is represented in the form of (key, value)
  * pairs.
  */
public interface PigBackEndProperties extends Serializable, 
                                              Iterable<String> {
        /**
         * Introduces a new (key, value) pair or updates one already 
         * associated to key.
         * 
         * @param key - the key to insert/update
         * @param value -the value for the given key
         * @return - the value of the old key, if it exists, null otherwise
         */
        public Object setValue(String key, Object value);
                
        /**
         * Given a resource, update configuration information. 
         * 
         * @param resource from which property values come from.
         * @return the set of keys and relative values that has been updated. 
         *         If resource contains/updates the same key multiple 
         *         times, only the initial value of key is returned.
         */
        public Map<String, Object> addFromResource(URI resource);
                
        /**
         * Creates or Updates (key,value) pairs with information 
         * from other
         * 
         * @param other - source of properties
         * @return - keys that have been updated, if any, and the        
         *           corresponding old values
         */
        public Map <String, Object> merge(PigBackEndProperties other);
                
        /**
         * Removes (key, value) pair if present
         * @param key - key to remove
         * @return - value of key, if key was present, null otherwise
         */
        public Object delete(String key);
                
        /**
         * Returns value of a key
         * @param key
         * @return value of key if present, null otherwise.
         */
        public Object getValue(String key);
        
        /**
         * @return number of (key, value) pairs stored
         */
        public long getCount();
}
}}}

=== Data Storage ===
This is a possible API for a generic interface that abstracts on the actual 
details used to store/persist collections of objects.

{{{
package org.apache.pig.datastorage;

import org.apache.pig.backend.PigBackEndProperties;

import java.io.Serializable;
import java.util.Map;
import java.net.URI;

/**
 * Abstraction for a generic property object that can be
 * used to specify configuration information, stats...
public interface DataStorageProperties extends PigBackEndProperties {
   ... 
}
}}}

{{{
package org.apache.pig.datastorage;

public interface DataStorage {
        
        /**
         * Place holder for possible initialization activities.
         */
        public void init();

        /**
         * Clean-up and releasing of resources.
         */
        public void close();
        
        /**
         * Provides configuration information about the storage itself.
         * For instance global data-replication policies if any, default
         * values, ... Some of such values could be overridden at a finer 
         * granularity (e.g. on a specific object in the Data Storage)
         * 
         * @return - configuration information
         */
        public DataStorageProperties getConfiguration();
        
        /**
         * Provides a way to change configuration parameters
         * at the Data Storage level. For instance, change the 
         * data replication policy.
         * 
         * @param newConfiguration - the new configuration settings
         * @throws when configuration conflicts are detected
         * 
         */
        public void updateConfiguration(DataStorageProperties 
                                      newConfiguration) 
             throws DataStorageConfigurationException;
        
        /**
         * Provides statistics on the Storage: capacity values, how much 
         * storage is in use...
         * @return statistics on the Data Storage
         */
        public DataStorageProperties getStatistics();
                
        /**
         * Creates an entity handle for an object (no containment
         * relation)
         *
         * @param name of the object
         * @return an object descriptor
         * @throws DataStorageException if name does not conform to naming 
         *         convention enforced by the Data Storage.
         */
        public DataStorageElementDescriptor asElement(String name) 
             throws DataStorageException;
        
        /**
         * Created an entity handle for a container.
         * 
         * @param name of the container
         * @return a container descripto
         * @throws DataStorageException if name does not conform to naming 
         *         convention enforced by the Data Storage.
         */
        public DataStorageContainerDescriptor asContainer(String name) 
             throws DataStorageException;

}
}}}

=== Data Storage Descriptors ===

{{{
package org.apache.pig.datastorage;


public interface DataStorageElementDescriptor extends Comparable {
        /**
         * Opens a stream onto which an entity can be written to.
         * 
         * @param configuration information at the object level
         * @return stream where to write
         * @throws DataStorageException
         */
        public DataStorageOutputStream create(
                     DataStorageProperties configuration) 
             throws DataStorageException;

        /**
         * Copy entity from an existing one, possibly residing in a 
         * different Data Storage.
         * 
         * @param dstName name of entity to create
         * @param dstConfiguration configuration for the new entity
         * @param removeSrc if src entity needs to be removed after copying it
         * @throws DataStorageException for instance, configuration 
         *         information for new entity is not compatible with 
         *         configuration information at the Data
         *         Storage level, user does not have privileges to read from
         *         source entity or write to destination storage...
         */
        public void copy(DataStorageElementDescriptor dstName,
                       DataStorageProperties dstConfiguration,
                       boolean removeSrc) 
             throws DataStorageException;
        
        /**
         * Open for read a given entity
         * 
         * @return entity to read from
         * @throws DataStorageExecption e.g. entity does not exist...
         */
        public DataStorageInputStream open() throws DataStorageException;

        /**
         * Open an element in the Data Storage with support for random access 
         * (seek operations).
         * 
         * @return a seekable input stream
         * @throws DataStorageException
         */
        public DataStorageSeekableInputStream sopen() 
             throws DataStorageException;
        
        /**
         * Checks whether the entity exists or not
         * 
         * @param name of entity
         * @return true if entity exists, false otherwise.
         */
        public boolean exists();
        
        /**
         * Changes the name of an entity in the Data Storage
         * 
         * @param newName new name of entity 
         * @throws DataStorageException 
         */
        public void rename(DataStorageElementDescriptor newName) 
             throws DataStorageException;

        /**
         * Remove entity from the Data Storage.
         * 
         * @throws DataStorageException
         */
        public void delete() throws DataStorageException;

        /**
         * Retrieve configuration information for entity
         * @return configuration
         */
        public DataStorageProperties getConfiguration();

        /**
         * Update configuration information for this entity
         *
         * @param newConfig configuration
         * @throws DataStorageException
         */
        public void updateConfiguration(DataStorageProperties newConfig) 
             throws DataStorageException;
        
        /**
         * List entity statistics
         * @return DataStorageProperties
         */
        public DataStorageProperties getStatistics();
}
}}}

{{{
package org.apache.pig.datastorage;

import org.apache.pig.datastorage.DataStorageElementDescriptor;

public interface DataStorageContainerDescriptor 
                 extends DataStorageElementDescriptor, 
                 Iterable<DataStorageElementDescriptor> {
}
}}}

Reply via email to