Hi Gourav,
In our case, we process raw logs into parquet tables that downstream
applications can use for other jobs. The desired outcome is that we only
need to worry about unbalanced input data at the preprocess step so that
downstream jobs can assume balanced input data.
In our specific case, t
Hi,
Using file size is a very bad way of managing data provided you think that
volume, variety and veracity does not holds true. Actually its a very bad
way of thinking and designing data solutions, you are bound to hit bottle
necks, optimization issues, and manual interventions.
I have found thi
The primary goal for balancing partitions would be for the write to S3. We
would like to prevent unbalanced partitions (can do with repartition), but also
avoid partitions that are too small or too large.
So for that case, getting the cache size would work Maropu if its roughly
accurate, but fo
Hi,
Since the final size depends on data types and compression. I've had to first
get a rough estimate of data, written to disk, then compute the number of
partitions.
partitions = int(ceil(size_data * conversion_ratio / block_size))
In my case block size 256mb, source txt & dest is snappy par
Hi,
There is no simple way to access the size in a driver side.
Since the partitions of primitive typed data (e.g., int) are compressed by
`DataFrame#cache`,
the actual size is possibly a little bit different from processing
partitions size.
// maropu
On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodri