[
https://issues.apache.org/jira/browse/CHUKWA-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ahmed Fathalla updated CHUKWA-4:
--------------------------------
Attachment: CHUKWA-4.patch
This patch contains a fix for corrupt sink files created locally. I've created
a new class CopySequenceFile which copies the corrupt .chukwa file to a valid
.done file.
The code for recovering a failed copy attempt is included in the cleanup()
method of LocalToRemoteHdfsMover and follows Jerome's suggestions. I have also
created a unit test that creates a sink file, converts it into a .done file and
validates that the .done file was created and the .chukwa file removed.
I have tested this solution several times and it seems to be working. However,
I have faced a rare case where recovery fails because I get the following
exception while reading from the .chukwa file / writing to the .done file
2010-04-12 07:56:47,538 WARN LocalToRemoteHdfsMover CopySequenceFile - Error
during .chukwa file recovery
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
at
org.apache.hadoop.chukwa.util.CopySequenceFile.createValidSequenceFile(CopySequenceFile.java:80)
at
org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.cleanup(LocalToRemoteHdfsMover.java:185)
at
org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.run(LocalToRemoteHdfsMover.java:215)
This seemed to happen when recovering from a .chukwa file that was just created
before the collector crashed (the .chukwa file size was about ~200KB) so I
guess it might be that the file has no actual data and should be removed. I
would appreciate it if you can point out how we can deal with this situation.
> Collectors don't finish writing .done datasink from last .chukwa datasink
> when stopped using bin/stop-collectors
> ----------------------------------------------------------------------------------------------------------------
>
> Key: CHUKWA-4
> URL: https://issues.apache.org/jira/browse/CHUKWA-4
> Project: Hadoop Chukwa
> Issue Type: Bug
> Components: data collection
> Environment: I am running on our local cluster. This is a linux
> machine that I also run Hadoop cluster from.
> Reporter: Andy Konwinski
> Priority: Minor
> Attachments: CHUKWA-4.patch
>
>
> When I use start-collectors, it creates the datasink as expected, writes to
> it as per normal, i.e. writes to the .chukwa file, and roll overs work fine
> when it renames the .chukwa file to .done. However, when I use
> bin/stop-collectors to shut down the running collector it leaves a .chukwa
> file in the HDFS file system. Not sure if this is a valid sink or not, but I
> think that the collector should gracefully clean up the datasink and rename
> it .done before exiting.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira