Re: Hadoop 1.0 Compatibility Discussion.

Dhruba Borthakur Mon, 20 Oct 2008 23:45:29 -0700

1. APIs that are deprecated in x.y release can be removed in (x+1).0 release.


2.  Old 1.x clients can connect to new 1.y servers, where x <= y but
the old clients might get reduced functionality or performance. 1.x
clients might not be able to connect to 2.z servers.

3. HDFS disk format can change from 1.x to 1.y release and is
transparent to user-application. A cluster when rolling back to 1.x
from 1,y will revert back to the old disk format.

>  * In a major release transition [ ie from a release x.y to a release
> (x+1).0], a user should be able to read data from the cluster running the
> old version.

I think this is a good requirement to have. This will be very useful
when we run multiple clusters, especially across data centers
(HADOOP-4058 is a use-case).

thanks,
dhruba

> --------
> What does Hadoop 1.0 mean?
>    * Standard release numbering: Only bug fixes in 1.x.y releases and new
> features in 1.x.0 releases.
>    * No need for client recompilation when upgrading from 1.x to 1.y, where
> x <= y
>          o  Can't remove deprecated classes or methods until 2.0
>     * Old 1.x clients can connect to new 1.y servers, where x <= y
>    * New FileSystem clients must be able to call old methods when talking to
> old servers. This generally will be done by having old methods continue to
> use old rpc methods. However, it is legal to have new implementations of old
> methods call new rpcs methods, as long as the library transparently handles
> the fallback case for old servers.
> -----------------
>
> A couple of  additional compatibility requirements:
>
> * HDFS metadata and data is preserved across release changes, both major and
> minor. That is,
> whenever a release is upgraded, the HDFS metadata from the old release will
> be converted automatically
> as needed.
>
> The above has been followed so far in Hadoop; I am just documenting it in
> the 1.0 requirements list.
>
>  * In a major release transition [ ie from a release x.y to a release
> (x+1).0], a user should be able to read data from the cluster running the
> old version.  (OR shall we generalize this to: from x.y to (x+i).z ?)
>
> The motivation: data copying across clusters is a common operation for many
> customers
> (for example this is routinely at done at Yahoo.). Today, http (or hftp)
> provides a guaranteed compatible way of copying data across versions.
>  Clearly one cannot force a customer to simultaneously update all its hadoop
> clusters on to
> a new major release. The above documents this requirement; we can satisfy it
> via the http/hftp mechanism or some other mechanism.
>
> Question: is one is willing to break applications that operate across
> clusters (ie an application that accesses data across clusters that cross a
> major release boundary? I asked the operations team at Yahoo that run our
> hadoop clusters. We currently do not have any applicaions that access data
> across clusters as part  of a MR job. The reason being that Hadoop routinely
> breaks  wire compatibility across releases and so such apps would be very
> unreliable. However, the copying of data across clusters is t is crucial and
> needs to be supported.
>
> Shall we add a stronger requirement for 1.0:  wire compatibility across
> major versions? This can be supported by class loading or other games. Note
> we can wait to provide this when 2.0 happens. If Hadoop provided this
> guarantee then it would allow customers to partition their data across
> clusters without risking apps breaking across major releases due to wire
> incompatibility issues.
>
>
>

Re: Hadoop 1.0 Compatibility Discussion.

Reply via email to