[jira] Updated: (CHUKWA-4) Collectors don't finish writing .done datasink from last .chukwa datasink when stopped using bin/stop-collectors

Ahmed Fathalla (JIRA) Mon, 12 Apr 2010 10:28:14 -0700

     [ 
https://issues.apache.org/jira/browse/CHUKWA-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmed Fathalla updated CHUKWA-4:
--------------------------------

    Attachment: CHUKWA-4.patch

This patch contains a fix for corrupt sink files created locally. I've created 
a new class CopySequenceFile which copies the corrupt .chukwa file to a valid 
.done file.

The code for recovering a failed copy attempt is included in the cleanup() 
method of LocalToRemoteHdfsMover and follows Jerome's suggestions. I have also 
created a unit test that creates a sink file, converts it into a .done file and 
validates that the .done file was created and the .chukwa file removed.

I have tested this solution several times and it seems to be working. However, 
I have faced a rare case where recovery fails because I get the following 
exception while reading from the .chukwa file / writing to the .done file


2010-04-12 07:56:47,538 WARN LocalToRemoteHdfsMover CopySequenceFile - Error 
during .chukwa file recovery
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
        at 
org.apache.hadoop.chukwa.util.CopySequenceFile.createValidSequenceFile(CopySequenceFile.java:80)
        at 
org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.cleanup(LocalToRemoteHdfsMover.java:185)
        at 
org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.run(LocalToRemoteHdfsMover.java:215)


This seemed to happen when recovering from a .chukwa file that was just created 
before the collector crashed (the .chukwa file size was about ~200KB) so I 
guess it might be that the file has no actual data and should be removed. I 
would appreciate it if you can point out how we can deal with this situation.

> Collectors don't finish writing .done datasink from last .chukwa datasink 
> when stopped using bin/stop-collectors
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CHUKWA-4
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-4
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>         Environment: I am running on our local cluster. This is a linux 
> machine that I also run Hadoop cluster from.
>            Reporter: Andy Konwinski
>            Priority: Minor
>         Attachments: CHUKWA-4.patch
>
>
> When I use start-collectors, it creates the datasink as expected, writes to 
> it as per normal, i.e. writes to the .chukwa file, and roll overs work fine 
> when it renames the .chukwa file to .done. However, when I use 
> bin/stop-collectors to shut down the running collector it leaves a .chukwa 
> file in the HDFS file system. Not sure if this is a valid sink or not, but I 
> think that the collector should gracefully clean up the datasink and rename 
> it .done before exiting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CHUKWA-4) Collectors don't finish writing .done datasink from last .chukwa datasink when stopped using bin/stop-collectors

Reply via email to