Re: Writing click stream data to hadoop

Harsh J Fri, 25 May 2012 09:31:31 -0700

Mohit,

Not if you call sync (or hflush/hsync in 2.0) periodically to persist
your changes to the file. SequenceFile doesn't currently have a
sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
underlying output stream instead at the moment. This is possible to do
in 1.0 (just own the output stream).


Your use case also sounds like you may want to simply use Apache Flume
(Incubating) [http://incubator.apache.org/flume/] that already does
provide these features and the WAL-kinda reliability you seek.

On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> We get click data through API calls. I now need to send this data to our
> hadoop environment. I am wondering if I could open one sequence file and
> write to it until it's of certain size. Once it's over the specified size I
> can close that file and open a new one. Is this a good approach?
>
> Only thing I worry about is what happens if the server crashes before I am
> able to cleanly close the file. Would I lose all previous data?



-- 
Harsh J

Re: Writing click stream data to hadoop

Reply via email to