Re: Window function query takes too long to complete and return results

Khurram Faraaz Tue, 09 Jun 2015 23:31:39 -0700

JIRA 3269 is opened to track this behavior.
I tried to iterate over the ResultSet from a JDBC program, I only iterated
over the results until there were records, no results were
processed/printed. It still took close to nine minutes to complete
execution.


Here is a snippet of what I did from JDBC.

String query = "select count(*) over(partition by cast(columns[1] as
varchar(25)) order by cast(columns[0] as bigint)) from `manyDuplicates.csv`"
;



                ResultSet rs = stmt.executeQuery(query);


                while (rs.next()) {

                    System.out.println("1");

                }

On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <[email protected]>
wrote:

> In cases like this where you are printing millions of record in SQLLINE,
> you should pipe the output to /dev/null or to a file, and measure the
> performance that way. I'm guessing that most of the time in this case is
> spent printing the output to the console, and thus really unrelated to
> Drill performance. If piping the data to a file or /dev/null causes the
> query to run much faster, than it probably isn't a real issue.
>
> also, anytime you are investigating a performance related issue, you should
> always check the profile. In this case, I suspect you might see that most
> of the time is spent in the WAIT time of the SCREEN operator. That would
> indicate that client side processing is slowing the query down.
>
> On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <[email protected]
> >
> wrote:
>
> > please open a JIRA issue. please provide the test file (compressed) or a
> > script to generate similar data.
> >
> > Thanks!
> >
> > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <[email protected]>
> > wrote:
> >
> > > Query that uses window functions takes too long to complete and return
> > > results. It returns close to a million records, for which it took 533.8
> > > seconds ~8 minutes
> > > Input CSV file has two columns, one integer and another varchar type
> > > column. Please let me know if this needs to be investigated and I can
> > > report a JIRA to track this if required ?
> > >
> > > Size of the input CSV file
> > >
> > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> > >
> > > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > > /tmp/manyDuplicates.csv
> > >
> > > {code}
> > >
> > > select count(*) over(partition by cast(columns[1] as varchar(25)) order
> > by
> > > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> > >
> > > ...
> > >
> > > 1,000,007 rows selected (533.857 seconds)
> > > {code}
> > >
> > > There are five distinct values in columns[1] in the CSV file. = [FIVE
> > > PARTITIONS]
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > > `manyDuplicates.csv`;
> > >
> > > *+-----------------------+*
> > >
> > > *| **       EXPR$0        ** |*
> > >
> > > *+-----------------------+*
> > >
> > > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> > >
> > > *| *PPPPQQQQRRRRSSSSTTTT * |*
> > >
> > > *| *AAAABBBBCCCCDDDDEEEE * |*
> > >
> > > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> > >
> > > *| *KKKKLLLLMMMMNNNNOOOO * |*
> > >
> > > *+-----------------------+*
> > >
> > > 5 rows selected (1.906 seconds)
> > > {code}
> > >
> > > Here is the count for each of those values in columns[1]
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *200484 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.961 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *199353 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.86 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *200702 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.826 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *199916 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.851 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *199552 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.827 seconds)
> > > {code}
> > >
> > > Thanks,
> > > Khurram
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Window function query takes too long to complete and return results

Reply via email to