Re: [jira] [Commented] (DRILL-4800) Improve parquet reader performance

Parth Chandra Tue, 26 Jul 2016 18:29:45 -0700

Don't have the numbers for kernel and user times but let me see if I can
dig out the other numbers. Kunal did a bunch of the work (I've updated the
doc to reflect his contribution :) ).
The purpose of Solo was, in fact, to establish how fast I could drive the
file system after eliminating the decoding and decompression of data but
reading a Parquet file as we would want to in Drill.
If we have the numbers I'll add them to the doc.



On Tue, Jul 26, 2016 at 5:11 PM, Jacques Nadeau <[email protected]> wrote:

> Sorry my first email wasn't clearer (and had missing words).
>
> My question was, what is the maximum direct byte throughput of the
> underlying filesystem your reading against (when not cached). Let's call
> that the Optimal case. One way to do this might be to do a parallel hdfs fs
> -cat "file" > /dev/null.
>
> The second question kernel, user and io wait time per workload. So we could
> get a snapshot something like this.
>
> | Reader | Transfer Rate | Kernel | User | IO |
>   Drill 1.7
>   Other
>   Solo
>   Optimal
>
> If the specific kernel and user times are too difficult (mostly in the 1.7
> and other cases probably), maybe just io wait and cpu load and total test
> duration for a fixed workload for each would suffice?
>
> Even if this isn't possible, that's lots of great stuff in what you put
> together. Was just trying to understand the bounding box.
>
> thanks,
> Jacques
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Jul 25, 2016 at 3:17 PM, Parth Chandra <[email protected]>
> wrote:
>
> > Didn't quite catch your question there. But I do have the following
> numbers
> > from the file system -
> >
> >                                    | AvgIOR OpSize (KB) | Estimated
> > Ops/Disk
> > Drill 1.7.0 - uncached |                239              |
> > 103
> > Solo Uncached           |               240              |
> > 281
> >
> > The numbers are approximate as these are captured by scripts on all the
> > nodes and then averaged by another script.
> >
> > Solo is close to as fast as is possible from disk.
> >
> > Is that what you were looking for?
> >
>

Re: [jira] [Commented] (DRILL-4800) Improve parquet reader performance

Reply via email to