Re: Window function query takes too long to complete and return results

Khurram Faraaz Wed, 10 Jun 2015 10:41:33 -0700

Great! I will re-run the query on latest level and also verify results
against Postgress results.


On Wed, Jun 10, 2015 at 9:27 AM, Abdel Hakim Deneche <[email protected]>
wrote:

> I tried the query using the new implementation (DRILL-3200) and it's much
> more faster: 14 seconds compared to 523 seconds using the current
> implementation. I didn't check the results though.
>
> On Tue, Jun 9, 2015 at 11:30 PM, Khurram Faraaz <[email protected]>
> wrote:
>
> > JIRA 3269 is opened to track this behavior.
> > I tried to iterate over the ResultSet from a JDBC program, I only
> iterated
> > over the results until there were records, no results were
> > processed/printed. It still took close to nine minutes to complete
> > execution.
> >
> > Here is a snippet of what I did from JDBC.
> >
> > String query = "select count(*) over(partition by cast(columns[1] as
> > varchar(25)) order by cast(columns[0] as bigint)) from
> > `manyDuplicates.csv`"
> > ;
> >
> >
> >
> >                 ResultSet rs = stmt.executeQuery(query);
> >
> >
> >                 while (rs.next()) {
> >
> >                     System.out.println("1");
> >
> >                 }
> >
> > On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <[email protected]>
> > wrote:
> >
> > > In cases like this where you are printing millions of record in
> SQLLINE,
> > > you should pipe the output to /dev/null or to a file, and measure the
> > > performance that way. I'm guessing that most of the time in this case
> is
> > > spent printing the output to the console, and thus really unrelated to
> > > Drill performance. If piping the data to a file or /dev/null causes the
> > > query to run much faster, than it probably isn't a real issue.
> > >
> > > also, anytime you are investigating a performance related issue, you
> > should
> > > always check the profile. In this case, I suspect you might see that
> most
> > > of the time is spent in the WAIT time of the SCREEN operator. That
> would
> > > indicate that client side processing is slowing the query down.
> > >
> > > On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > please open a JIRA issue. please provide the test file (compressed)
> or
> > a
> > > > script to generate similar data.
> > > >
> > > > Thanks!
> > > >
> > > > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Query that uses window functions takes too long to complete and
> > return
> > > > > results. It returns close to a million records, for which it took
> > 533.8
> > > > > seconds ~8 minutes
> > > > > Input CSV file has two columns, one integer and another varchar
> type
> > > > > column. Please let me know if this needs to be investigated and I
> can
> > > > > report a JIRA to track this if required ?
> > > > >
> > > > > Size of the input CSV file
> > > > >
> > > > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> > > > >
> > > > > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > > > > /tmp/manyDuplicates.csv
> > > > >
> > > > > {code}
> > > > >
> > > > > select count(*) over(partition by cast(columns[1] as varchar(25))
> > order
> > > > by
> > > > > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> > > > >
> > > > > ...
> > > > >
> > > > > 1,000,007 rows selected (533.857 seconds)
> > > > > {code}
> > > > >
> > > > > There are five distinct values in columns[1] in the CSV file. =
> [FIVE
> > > > > PARTITIONS]
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > > > > `manyDuplicates.csv`;
> > > > >
> > > > > *+-----------------------+*
> > > > >
> > > > > *| **       EXPR$0        ** |*
> > > > >
> > > > > *+-----------------------+*
> > > > >
> > > > > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> > > > >
> > > > > *| *PPPPQQQQRRRRSSSSTTTT * |*
> > > > >
> > > > > *| *AAAABBBBCCCCDDDDEEEE * |*
> > > > >
> > > > > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> > > > >
> > > > > *| *KKKKLLLLMMMMNNNNOOOO * |*
> > > > >
> > > > > *+-----------------------+*
> > > > >
> > > > > 5 rows selected (1.906 seconds)
> > > > > {code}
> > > > >
> > > > > Here is the count for each of those values in columns[1]
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *200484 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.961 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *199353 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.86 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *200702 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.826 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *199916 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.851 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *199552 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.827 seconds)
> > > > > {code}
> > > > >
> > > > > Thanks,
> > > > > Khurram
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Window function query takes too long to complete and return results

Reply via email to