[
https://issues.apache.org/jira/browse/CAMEL-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244156#comment-14244156
]
Willem Jiang commented on CAMEL-8040:
-------------------------------------
Hi Josef,
I double check the camel route that you have, you are using the default setting
of hdfs and file endpoints. Unfortunately, the default setting of file endpoint
is not work as you expected.
If you take a look at the [HDFS2|https://camel.apache.org/hdfs2], you can find
the chunkSize option is for "When reading a normal file, this is split into
chunks producing a message per chunk." As the file endpoint is in Override
mode, it just treat the new trunk block in the message body as a new file body,
so it keep override it. If you want to get the whole file, you need to setup
the file endpoint fileExist option to Append.
> camel-hdfs2 consumer overwriting data instead of appending them
> ---------------------------------------------------------------
>
> Key: CAMEL-8040
> URL: https://issues.apache.org/jira/browse/CAMEL-8040
> Project: Camel
> Issue Type: Bug
> Components: camel-hdfs
> Affects Versions: 2.13.0, 2.14.0
> Reporter: Josef Ludvíček
> Assignee: Willem Jiang
> Attachments: hdfs-reproducer.zip
>
>
> h1. camel-hdfs2 consumer overwriting data instead of appending them
> There is probably bug in camel hdfs2 consumer.
> In this project are two camel routes, one taking files from `test-source` and
> uploading them to hadoop hdfs,
> another route watching folder in hadoop hdfs and downloading them to
> `test-dest` folder in this project.
> It seems, that when downloading file from hdfs to local filesystem, it keeps
> writing chunks of data to begining of target file in test-source, instead of
> simply appending chunks, as I would expect.
> From camel log i suppose, that each chunk of data from hadoop file is treated
> it was whole file.
> Ruby script `generate_textfile.rb` can generate file `test.txt` with content
> {code}
> 0 - line
> 1 - line
> 2 - line
> 3 - line
> 4 - line
> 5 - line
> ...
> ...
> 99999 - line
> {code}
> h2. Scenario
> - _expecting running hadoop instance on localhost:8020_
> - run mvn camel:run
> - copy test.txt into test-source
> - see log and file test.txt in test-dest
> - rest.txt in test-dest folder should contain only last x lines of original
> one.
>
>
> Camel log
> {code}
> [localhost:8020/tmp/camel-test/] toFile INFO picked up
> file from hdfs with name test.txt
> [localhost:8020/tmp/camel-test/] toFile INFO file
> downloaded from hadoop
> [localhost:8020/tmp/camel-test/] toFile INFO picked up
> file from hdfs with name test.txt
> [localhost:8020/tmp/camel-test/] toFile INFO file
> downloaded from hadoop
> [localhost:8020/tmp/camel-test/] toFile INFO picked up
> file from hdfs with name test.txt
> [localhost:8020/tmp/camel-test/] toFile INFO file
> downloaded from hadoop
> [localhost:8020/tmp/camel-test/] toFile INFO picked up
> file from hdfs with name test.txt
> [localhost:8020/tmp/camel-test/] toFile INFO file
> downloaded from hadoop
> {code}
>
> h2. Envoriment
> * camel 2.14 and 2.13
> * hadoop VirtualBox VM
> * * downloaded from
> http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-2-x.html
> * * tested with version 2.3.0-cdh5.1.0,
> r8e266e052e423af592871e2dfe09d54c03f6a0e8 which I couldn't find on download
> page
> * hadoop docker image
> * * https://github.com/sequenceiq/hadoop-docker
> * * results were the same as with virtualbox vm
> In case ov VirtualBox VM, by default it binds hdfs to
> `hdfs://quickstart.cloudera:8020` and it needs to be changed in
> `/etc/hadoop/conf/core-site.xml`. It should work when `fs.defaultFS` is set
> to `hdfs://0.0.0.0:8020`.
> In case of docker hadoop image, first start docker container, figure out its
> ip address, and use it for camel hdfs component.
> Here camel uri would be `hdfs:172.17.0.2:9000/tmp/camel-test`.
> {code}
> docker run -i -t sequenceiq/hadoop-docker:2.5.1 /etc/bootstrap.sh -bash
> Starting sshd: [ OK ]
> Starting namenodes on [966476255fc2]
> 966476255fc2: starting namenode, logging to
> /usr/local/hadoop/logs/hadoop-root-namenode-966476255fc2.out
> localhost: starting datanode, logging to
> /usr/local/hadoop/logs/hadoop-root-datanode-966476255fc2.out
> Starting secondary namenodes [0.0.0.0]
> 0.0.0.0: starting secondarynamenode, logging to
> /usr/local/hadoop/logs/hadoop-root-secondarynamenode-966476255fc2.out
> starting yarn daemons
> starting resourcemanager, logging to
> /usr/local/hadoop/logs/yarn--resourcemanager-966476255fc2.out
> localhost: starting nodemanager, logging to
> /usr/local/hadoop/logs/yarn-root-nodemanager-966476255fc2.out
> {code}
> see to which IP hdfs filesystem api is bound to inside docker container
> {code}
> bash-4.1# netstat -tulnp
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address Foreign Address
> State PID/Program name
> ...
> tcp 0 0 172.17.0.2:9000 0.0.0.0:*
> LISTEN -
> ...
> {code}
> There might be Exception because of hdfs permissions. It could be solved by
> setting hdfs filesystem permissions.
> {code}
> bash-4.1# /usr/local/hadoop/bin/hdfs dfs -chmod 777 /
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)