[
https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gil Vernik updated MAPREDUCE-6854:
----------------------------------
Description:
Consider an example: a local file "/data/a.txt" need to be copied into
swift://container.service/data/a.txt
The way distcp works is that first it will upload "/data/a.txt" into
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
Upon completion distcp will move
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
into swift://container.mil01/data/a.txt
************************************
The temporary file naming convention assumes that each map task will
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names. Most of Hadoop eco system components
use object.name which is part of the temporary name, however distcp doesn't use
such approach.
This JIRA propose to add a configuration key indicating that temporary objects
will also include object name as part of their temporary file name,
For example
"/data/a.txt" will be uploaded into
"swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
"a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects
flows in the access drivers, since "a.txt" is not considered as a sub-directory
so no special operations will be taken. The benefit is that different systems
may expect "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" and
extract value prior "distcp.tmp"
was:
Consider an example: a local file "/data/a.txt" need to be copied into
swift://container.service/data/a.txt
The way distcp works is that first it will upload "/data/a.txt" into
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
Upon completion distcp will move
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
into swift://container.mil01/data/a.txt
************************************
The temporary file naming convention assumes that each map task will
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names. Most of Hadoop eco system components
use object.name which is part of the temporary name, however distcp doesn't use
such approach.
This JIRA propose to add a configuration key indicating that temporary objects
will also include object name as part of their temporary file name,
For example
"/data/a.txt" will be uploaded into
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0/a.txt"
or
"swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
"a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects
flows in the drivers, since "a.txt" is not considered as a sub-directory so no
special operations will be taken. The benefit is that different systems may
expect "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" and extract
value prior "distcp.tmp"
> Each map task should create a unique temporary name that includes an object
> name
> --------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6854
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: distcp
> Reporter: Gil Vernik
>
> Consider an example: a local file "/data/a.txt" need to be copied into
> swift://container.service/data/a.txt
> The way distcp works is that first it will upload "/data/a.txt" into
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
> Upon completion distcp will move
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
> into swift://container.mil01/data/a.txt
> ************************************
> The temporary file naming convention assumes that each map task will
> sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
> and then rename them to the final names. Most of Hadoop eco system
> components use object.name which is part of the temporary name, however
> distcp doesn't use such approach.
> This JIRA propose to add a configuration key indicating that temporary
> objects will also include object name as part of their temporary file name,
> For example
> "/data/a.txt" will be uploaded into
> "swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
> "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects
> flows in the access drivers, since "a.txt" is not considered as a
> sub-directory so no special operations will be taken. The benefit is that
> different systems may expect
> "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" and extract value
> prior "distcp.tmp"
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]