Re: Writing click stream data to hadoop

2012-05-30 Thread Mohit Anchlia
On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).

 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.


Thanks Harsh, Does flume also provides API on top. I am getting this data
as http call, how would I go about using flume with http calls?


 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  We get click data through API calls. I now need to send this data to our
  hadoop environment. I am wondering if I could open one sequence file and
  write to it until it's of certain size. Once it's over the specified
 size I
  can close that file and open a new one. Is this a good approach?
 
  Only thing I worry about is what happens if the server crashes before I
 am
  able to cleanly close the file. Would I lose all previous data?



 --
 Harsh J



Re: Writing click stream data to hadoop

2012-05-30 Thread alo alt
I cc'd flume-u...@incubator.apache.org because I don't know if Mohit subscribed 
there.

Mohit,

you could use Avro to serialize the data and send them to a Flume Avro source. 
Or you could syslog - both are supported in Flume 1.x. 
http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/FlumeUserGuide.html

An exec-source is also possible, please note, flume will only start / use the 
command you configured and didn't take control over the whole process.

- Alex 



--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 4:56 PM, Mohit Anchlia wrote:

 On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
 
 Mohit,
 
 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).
 
 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.
 
 
 Thanks Harsh, Does flume also provides API on top. I am getting this data
 as http call, how would I go about using flume with http calls?
 
 
 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified
 size I
 can close that file and open a new one. Is this a good approach?
 
 Only thing I worry about is what happens if the server crashes before I
 am
 able to cleanly close the file. Would I lose all previous data?
 
 
 
 --
 Harsh J
 



Re: Writing click stream data to hadoop

2012-05-30 Thread Luke Lu
SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
0.20.205), which calls the underlying FSDataOutputStream#sync which is
actually hflush semantically (data not durable in case of data center
wide power outage). hsync implementation is not yet in 2.0. HDFS-744
just brought hsync in trunk.

__Luke

On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).

 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.

 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified size I
 can close that file and open a new one. Is this a good approach?

 Only thing I worry about is what happens if the server crashes before I am
 able to cleanly close the file. Would I lose all previous data?



 --
 Harsh J


Re: Writing click stream data to hadoop

2012-05-30 Thread Harsh J
Thanks for correcting me there on the syncFs call Luke. I seemed to
have missed that method when searching branch-1 code.

On Thu, May 31, 2012 at 6:54 AM, Luke Lu l...@apache.org wrote:

 SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
 0.20.205), which calls the underlying FSDataOutputStream#sync which is
 actually hflush semantically (data not durable in case of data center
 wide power outage). hsync implementation is not yet in 2.0. HDFS-744
 just brought hsync in trunk.

 __Luke

 On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
  Mohit,
 
  Not if you call sync (or hflush/hsync in 2.0) periodically to persist
  your changes to the file. SequenceFile doesn't currently have a
  sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
  underlying output stream instead at the moment. This is possible to do
  in 1.0 (just own the output stream).
 
  Your use case also sounds like you may want to simply use Apache Flume
  (Incubating) [http://incubator.apache.org/flume/] that already does
  provide these features and the WAL-kinda reliability you seek.
 
  On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com 
  wrote:
  We get click data through API calls. I now need to send this data to our
  hadoop environment. I am wondering if I could open one sequence file and
  write to it until it's of certain size. Once it's over the specified size I
  can close that file and open a new one. Is this a good approach?
 
  Only thing I worry about is what happens if the server crashes before I am
  able to cleanly close the file. Would I lose all previous data?
 
 
 
  --
  Harsh J




--
Harsh J


Re: Writing click stream data to hadoop

2012-05-25 Thread Harsh J
Mohit,

Not if you call sync (or hflush/hsync in 2.0) periodically to persist
your changes to the file. SequenceFile doesn't currently have a
sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
underlying output stream instead at the moment. This is possible to do
in 1.0 (just own the output stream).

Your use case also sounds like you may want to simply use Apache Flume
(Incubating) [http://incubator.apache.org/flume/] that already does
provide these features and the WAL-kinda reliability you seek.

On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified size I
 can close that file and open a new one. Is this a good approach?

 Only thing I worry about is what happens if the server crashes before I am
 able to cleanly close the file. Would I lose all previous data?



-- 
Harsh J