Mark Christiaens created HDFS-13604:
---------------------------------------
Summary: DistCp filtering conflicts with snapshotting
Key: HDFS-13604
URL: https://issues.apache.org/jira/browse/HDFS-13604
Project: Hadoop HDFS
Issue Type: Bug
Components: distcp
Reporter: Mark Christiaens
DistCp has an option to filter (not copy) files that match one of the file
patterns in a file. DistCp also has options where it optimizes incremental
copying based on snapshots present at the source and target location. When
enabling both options, files that should be copied from source to target are
missing on the target.
To reproduce the issue:
* Create two directories, {{source}} and {{target}}.
* In {{source}}, put two files, {{A}} and {{B}}, with some random content.
* Create a filter file that filters {{A}} (so blocks copying {{A}}).
* Create a snapshot, {{snapshot_old}}, of the {{source}} directory.
* Use {{distcp}} to copy the content of {{source}} to {{target}}.
* As expected, the {{target}} directory will contain only file {{B}}. {{A}}
is filtered.
* Take a snapshot of the target directory, snapshot_old.
* In the {{source}} directory, rename {{A}} to {{C}}.
* Take a new snapshot of the source directory, {{snapshot_new}}.
* Now, perform an incremental {{distcp}} copy using the created snapshots so
as to optimize the incremental copy process: {{distcp -update -filters
filters.txt -diff snapshot_old snapshot_new ... ...}}
* You will find that the newly created file {{C}} is not copied to the
{{target}} directory.
I suspect that the reason for this is that {{distcp}} concludes from analyzing
the difference between {{snapshot_source}} and {{snapshot_source_new}} that
{{A}} was renamed to {{C}}. This can be confirmed by using {{snapshotDiff}} to
compare the two snapshot: it reports that {{A}} has been renamed to {{C}}.
{{distcp}} seems to then assume that the data for {{C}} is already present in
the {{target}} directory and only needs to be renamed. However, due to the
filtering, {{A}} is {{not}} present on the target and cannot be renamed to
{{C}}.
Although the final {{distcp}} fails to create a copy of the {{C}} file in the
{{target}} directory, {{distcp}} does not report any failure, nor can I find
any trace of errors in the job logs of the jobs created by {{distcp}} to
execute the actual copy.
So, some options:
* Combining {{-diff}} and {{-filters}} could be disallowed.
* {{distcp}} could assume that files that have been filtered are _not_ present
and should be replicated in ordinary fashion.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]