[
https://issues.apache.org/jira/browse/HIVE-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar updated HIVE-15093:
--------------------------------
Attachment: HIVE-15093.6.patch
> For S3-to-S3 renames, files should be moved individually rather than at a
> directory level
> -----------------------------------------------------------------------------------------
>
> Key: HIVE-15093
> URL: https://issues.apache.org/jira/browse/HIVE-15093
> Project: Hive
> Issue Type: Sub-task
> Components: Hive
> Affects Versions: 2.1.0
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Attachments: HIVE-15093.1.patch, HIVE-15093.2.patch,
> HIVE-15093.3.patch, HIVE-15093.4.patch, HIVE-15093.5.patch, HIVE-15093.6.patch
>
>
> Hive's MoveTask uses the Hive.moveFile method to move data within a
> distributed filesystem as well as blobstore filesystems.
> If the move is done within the same filesystem:
> 1: If the source path is a subdirectory of the destination path, files will
> be moved one by one using a threapool of workers
> 2: If the source path is not a subdirectory of the destination path, a single
> rename operation is used to move the entire directory
> The second option may not work well on blobstores such as S3. Renames are not
> metadata operations and require copying all the data. Client connectors to
> blobstores may not efficiently rename directories. Worst case, the connector
> will copy each file one by one, sequentially rather than using a threadpool
> of workers to copy the data (e.g. HADOOP-13600).
> Hive already has code to rename files using a threadpool of workers, but this
> only occurs in case number 1.
> This JIRA aims to modify the code so that case 1 is triggered when copying
> within a blobstore. The focus is on copies within a blobstore because
> needToCopy will return true if the src and target filesystems are different,
> in which case a different code path is triggered.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)