Drill's default behavior is to use estimates to determine the number of
files that will be written.  The equation is fairly complicated.  However,
there are three key variables that will impact file splits.  These are:

planner.slice_target: targeted number of records to allow within a single
slice before increasing parallelization (defaults to 1mm in 0.4, 100k in
0.5)
planner.width.max_per_node: maximum number of slices run per node (defaults
to 0.7 * core count)
store.parquet.block-size:   largest allowed row group when generating
Parquet files.  (defaults to 512mb)

If you are having more files than you would like, you can
decrease planner.width.max_per_node to a smaller number.

It's likely that Jim Scott's experience with a smaller number of files was
due to running on a machine with a smaller number of cores or the optimizer
estimating a smaller amount of data in the output.  The behavior is data
and machine dependent.

thanks,
Jacques


On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <[email protected]> wrote:

> I have created tables with Drill in parquet format and it created 2 files.
>
>
> On Fri, Sep 5, 2014 at 3:46 PM, Jim <[email protected]> wrote:
>
> >
> > Actually, it looks like it always breaks it into 6 pieces by default. Is
> > there a way to make the partition size fixed rather than the number of
> > partitions?
> >
> >
> > On 09/05/2014 04:40 PM, Jim wrote:
> >
> >> Hello all,
> >>
> >> I've been experimenting with drill to load data into Parquet files. I
> >> noticed rather large variability in the size of each parquet chunk. Is
> >> there a way to control this?
> >>
> >> The documentation seems a little sparse on configuring some of the finer
> >> details. My apologies if I missed something obvious.
> >>
> >> Thanks
> >> Jim
> >>
> >>
> >
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
>
>  <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>
>

Reply via email to