Github user detonator413 commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137480178
  
    Hi Max,
    
    Look at the distcp utility 
(http://hadoop.apache.org/docs/r1.2.1/distcp.html 
<http://hadoop.apache.org/docs/r1.2.1/distcp.html>). The purpose of it is to 
copy big amount of files within one cluster or between clusters. In local mode 
the tool will also work for local FS, whereas in the distributed mode only HDFS 
paths are supposed to be used. I made a simple benchmark on copying 800GB of 
data within one cluster running Hadoop distcp (using default distcp input 
format ) and Flink distcp in parallel. Flink job was 1.5 minutes faster (it 
took approximately 35 minutes in our setup).
    
    Slava
    
    > On 03 Sep 2015, at 17:00, Max <notificati...@github.com> wrote:
    > 
    > Thanks for your pull request! I'm assuming you would use this utility to 
copy files from your local to a remote file system, right? Your utility starts 
a Flink job to copy the files to the remote file systems. This only works if 
you execute it locally because otherwise the task managers need to have the 
files available and that might defeat the utility's purpose. Also, imagine 
someone embedding the tool in a Flink program. The person might wonder why 
his/her program actually executes two jobs (one for the utility, one for the 
actual job).
    > 
    > I think this would be more useful as a utility function, e.g. in a 
FileUtils class in flink-core. The method there would receive a list of files 
and then upload the files like you did using Flink's FileSystem abstraction. We 
could still parallelize the method by starting multiple threads to upload the 
files.
    > 
    > Correct me if I'm wrong or misunderstood your pull request :)
    > 
    > —
    > Reply to this email directly or view it on GitHub 
<https://github.com/apache/flink/pull/1090#issuecomment-137477152>.
    > 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to