Using the right email for Nezih
On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <r...@databricks.com> wrote: > I think this can be useful. > > The only thing is that we are slowly migrating to the Dataset/DataFrame > API, and leave RDD mostly as is as a lower level API. Maybe we should do > both? In either case it would be great to discuss the API on a pull > request. Cheers. > > On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi < > nyigitb...@netflix.com.invalid> wrote: > >> Hi Spark devs, >> >> I have sent an email about my problem some time ago where I want to merge >> a large number of small files with Spark. Currently I am using Hive with >> the CombineHiveInputFormat and I can control the size of the output >> files with the max split size parameter (which is used for coalescing the >> input splits by the CombineHiveInputFormat). My first attempt was to use >> coalesce(), but since coalesce only considers the target number of >> partitions the output file sizes were varying wildly. >> >> What I think can be useful is to have an optional PartitionCoalescer >> parameter (a new interface) in the coalesce() method (or maybe we can >> add a new method ?) that the callers can implement for custom coalescing >> strategies — for my use case I have already implemented a >> SizeBasedPartitionCoalescer that coalesces partitions by looking at >> their sizes and by using a max split size parameter, similar to the >> CombineHiveInputFormat (I also had to expose HadoopRDD to get access to >> the individual split sizes etc.). >> >> What do you guys think about such a change, can it be useful to other >> users as well? Or do you think that there is an easier way to accomplish >> the same merge logic? If you think it may be useful, I already have an >> implementation and I will be happy to work with the community to contribute >> it. >> >> Thanks, >> Nezih >> >> > >