Re: Ideal file size

2012-06-07 Thread M. C. Srivas
On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote:

  Many factors to consider than just the size of the file.  . How long can
  you wait before you *have to* process the data?  5 minutes? 5 hours? 5
  days?  If you want good timeliness, you need to roll-over faster.  The
  longer you wait:
 
  1.  the lesser the load on the NN.
  2.  but the poorer the timeliness
  3.  and the larger chance of lost data  (ie, the data is not saved until
  the file is closed and rolled over, unless you want to sync() after every
  write)
 
  To Begin with I was going to use Flume and specify rollover file size. I
 understand the above parameters, I just want to ensure that too many small
 files doesn't cause problem on the NameNode. For instance there would be
 times when we get GBs of data in an hour and at times only few 100 MB. From
 what Harsh, Edward and you've described it doesn't cause issues with the
 NameNode but rather increase in processing times if there are too many
 small files. Looks like I need to find that balance.

 It would also be interesting to see how others solve this problem when not
 using Flume.



They use NFS with MapR.

Any and all log-rotators (like the one in log4j) simply just work over NFS,
and MapR does not have a NN, so the problems with small files or number of
files do not exist.





 
 
  On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   We have continuous flow of data into the sequence file. I am wondering
  what
   would be the ideal file size before file gets rolled over. I know too
  many
   small files are not good but could someone tell me what would be the
  ideal
   size such that it doesn't overload NameNode.
  
 



Re: Ideal file size

2012-06-07 Thread Abhishek Pratap Singh
Almost all the answers is already provided in this post. My 2 cents... try
to have a file size in multiple of block size so while processing number of
mappers are less and performance of the job is better..
You can also merge file in HDFS later on for processing.

Regards,
Abhishek



On Thu, Jun 7, 2012 at 7:29 AM, M. C. Srivas mcsri...@gmail.com wrote:

 On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote:
 
   Many factors to consider than just the size of the file.  . How long
 can
   you wait before you *have to* process the data?  5 minutes? 5 hours? 5
   days?  If you want good timeliness, you need to roll-over faster.  The
   longer you wait:
  
   1.  the lesser the load on the NN.
   2.  but the poorer the timeliness
   3.  and the larger chance of lost data  (ie, the data is not saved
 until
   the file is closed and rolled over, unless you want to sync() after
 every
   write)
  
   To Begin with I was going to use Flume and specify rollover file size.
 I
  understand the above parameters, I just want to ensure that too many
 small
  files doesn't cause problem on the NameNode. For instance there would be
  times when we get GBs of data in an hour and at times only few 100 MB.
 From
  what Harsh, Edward and you've described it doesn't cause issues with the
  NameNode but rather increase in processing times if there are too many
  small files. Looks like I need to find that balance.
 
  It would also be interesting to see how others solve this problem when
 not
  using Flume.
 


 They use NFS with MapR.

 Any and all log-rotators (like the one in log4j) simply just work over NFS,
 and MapR does not have a NN, so the problems with small files or number of
 files do not exist.



 
 
  
  
   On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
We have continuous flow of data into the sequence file. I am
 wondering
   what
would be the ideal file size before file gets rolled over. I know too
   many
small files are not good but could someone tell me what would be the
   ideal
size such that it doesn't overload NameNode.
   
  
 



Ideal file size

2012-06-06 Thread Mohit Anchlia
We have continuous flow of data into the sequence file. I am wondering what
would be the ideal file size before file gets rolled over. I know too many
small files are not good but could someone tell me what would be the ideal
size such that it doesn't overload NameNode.


Re: Ideal file size

2012-06-06 Thread Edward Capriolo
It does not matter what the file size is because the file size is
split into blocks which is what the NN tracks.

For larger deployments you can go with a large block size like 256MB
or even 512MB.  Generally the bigger the file the better split
calculation is very input format dependent however.

On Wed, Jun 6, 2012 at 10:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 We have continuous flow of data into the sequence file. I am wondering what
 would be the ideal file size before file gets rolled over. I know too many
 small files are not good but could someone tell me what would be the ideal
 size such that it doesn't overload NameNode.


Re: Ideal file size

2012-06-06 Thread Harsh J
The block size and file roll size values depend on a few items here:

- Rate at which the data is getting written.
- Frequency of your processing layer that is expected to run over
these files (sync() can help here though).
- The way by which you'll be processing these (MR/etc.).

Too many small files isn't only a problem for NameNode (far from it in
most cases), but is rather an issue for processing - You end up
wasting cycles on opening and closing files, instead of doing good
contiguous block reads (what HDFS directly/indirectly excels at, when
combined with processing).

On Wed, Jun 6, 2012 at 7:30 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 We have continuous flow of data into the sequence file. I am wondering what
 would be the ideal file size before file gets rolled over. I know too many
 small files are not good but could someone tell me what would be the ideal
 size such that it doesn't overload NameNode.



-- 
Harsh J


Re: Ideal file size

2012-06-06 Thread M. C. Srivas
Many factors to consider than just the size of the file.  . How long can
you wait before you *have to* process the data?  5 minutes? 5 hours? 5
days?  If you want good timeliness, you need to roll-over faster.  The
longer you wait:

1.  the lesser the load on the NN.
2.  but the poorer the timeliness
3.  and the larger chance of lost data  (ie, the data is not saved until
the file is closed and rolled over, unless you want to sync() after every
write)



On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 We have continuous flow of data into the sequence file. I am wondering what
 would be the ideal file size before file gets rolled over. I know too many
 small files are not good but could someone tell me what would be the ideal
 size such that it doesn't overload NameNode.



Re: Ideal file size

2012-06-06 Thread Mohit Anchlia
On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote:

 Many factors to consider than just the size of the file.  . How long can
 you wait before you *have to* process the data?  5 minutes? 5 hours? 5
 days?  If you want good timeliness, you need to roll-over faster.  The
 longer you wait:

 1.  the lesser the load on the NN.
 2.  but the poorer the timeliness
 3.  and the larger chance of lost data  (ie, the data is not saved until
 the file is closed and rolled over, unless you want to sync() after every
 write)

 To Begin with I was going to use Flume and specify rollover file size. I
understand the above parameters, I just want to ensure that too many small
files doesn't cause problem on the NameNode. For instance there would be
times when we get GBs of data in an hour and at times only few 100 MB. From
what Harsh, Edward and you've described it doesn't cause issues with the
NameNode but rather increase in processing times if there are too many
small files. Looks like I need to find that balance.

It would also be interesting to see how others solve this problem when not
using Flume.




 On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  We have continuous flow of data into the sequence file. I am wondering
 what
  would be the ideal file size before file gets rolled over. I know too
 many
  small files are not good but could someone tell me what would be the
 ideal
  size such that it doesn't overload NameNode.