Great! I will re-run the query on latest level and also verify results against Postgress results.
On Wed, Jun 10, 2015 at 9:27 AM, Abdel Hakim Deneche <[email protected]> wrote: > I tried the query using the new implementation (DRILL-3200) and it's much > more faster: 14 seconds compared to 523 seconds using the current > implementation. I didn't check the results though. > > On Tue, Jun 9, 2015 at 11:30 PM, Khurram Faraaz <[email protected]> > wrote: > > > JIRA 3269 is opened to track this behavior. > > I tried to iterate over the ResultSet from a JDBC program, I only > iterated > > over the results until there were records, no results were > > processed/printed. It still took close to nine minutes to complete > > execution. > > > > Here is a snippet of what I did from JDBC. > > > > String query = "select count(*) over(partition by cast(columns[1] as > > varchar(25)) order by cast(columns[0] as bigint)) from > > `manyDuplicates.csv`" > > ; > > > > > > > > ResultSet rs = stmt.executeQuery(query); > > > > > > while (rs.next()) { > > > > System.out.println("1"); > > > > } > > > > On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <[email protected]> > > wrote: > > > > > In cases like this where you are printing millions of record in > SQLLINE, > > > you should pipe the output to /dev/null or to a file, and measure the > > > performance that way. I'm guessing that most of the time in this case > is > > > spent printing the output to the console, and thus really unrelated to > > > Drill performance. If piping the data to a file or /dev/null causes the > > > query to run much faster, than it probably isn't a real issue. > > > > > > also, anytime you are investigating a performance related issue, you > > should > > > always check the profile. In this case, I suspect you might see that > most > > > of the time is spent in the WAIT time of the SCREEN operator. That > would > > > indicate that client side processing is slowing the query down. > > > > > > On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche < > > [email protected] > > > > > > > wrote: > > > > > > > please open a JIRA issue. please provide the test file (compressed) > or > > a > > > > script to generate similar data. > > > > > > > > Thanks! > > > > > > > > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <[email protected] > > > > > > wrote: > > > > > > > > > Query that uses window functions takes too long to complete and > > return > > > > > results. It returns close to a million records, for which it took > > 533.8 > > > > > seconds ~8 minutes > > > > > Input CSV file has two columns, one integer and another varchar > type > > > > > column. Please let me know if this needs to be investigated and I > can > > > > > report a JIRA to track this if required ? > > > > > > > > > > Size of the input CSV file > > > > > > > > > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv > > > > > > > > > > -rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 > > > > > /tmp/manyDuplicates.csv > > > > > > > > > > {code} > > > > > > > > > > select count(*) over(partition by cast(columns[1] as varchar(25)) > > order > > > > by > > > > > cast(columns[0] as bigint)) from `manyDuplicates.csv`; > > > > > > > > > > ... > > > > > > > > > > 1,000,007 rows selected (533.857 seconds) > > > > > {code} > > > > > > > > > > There are five distinct values in columns[1] in the CSV file. = > [FIVE > > > > > PARTITIONS] > > > > > > > > > > {code} > > > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from > > > > > `manyDuplicates.csv`; > > > > > > > > > > *+-----------------------+* > > > > > > > > > > *| ** EXPR$0 ** |* > > > > > > > > > > *+-----------------------+* > > > > > > > > > > *| *FFFFGGGGHHHHIIIIJJJJ * |* > > > > > > > > > > *| *PPPPQQQQRRRRSSSSTTTT * |* > > > > > > > > > > *| *AAAABBBBCCCCDDDDEEEE * |* > > > > > > > > > > *| *UUUUVVVVWWWWXXXXZZZZ * |* > > > > > > > > > > *| *KKKKLLLLMMMMNNNNOOOO * |* > > > > > > > > > > *+-----------------------+* > > > > > > > > > > 5 rows selected (1.906 seconds) > > > > > {code} > > > > > > > > > > Here is the count for each of those values in columns[1] > > > > > > > > > > {code} > > > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ'; > > > > > > > > > > *+---------+* > > > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > > > *+---------+* > > > > > > > > > > *| *200484 * |* > > > > > > > > > > *+---------+* > > > > > > > > > > 1 row selected (0.961 seconds) > > > > > > > > > > {code} > > > > > > > > > > > > > > > {code} > > > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT'; > > > > > > > > > > *+---------+* > > > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > > > *+---------+* > > > > > > > > > > *| *199353 * |* > > > > > > > > > > *+---------+* > > > > > > > > > > 1 row selected (0.86 seconds) > > > > > > > > > > {code} > > > > > > > > > > > > > > > {code} > > > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE'; > > > > > > > > > > *+---------+* > > > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > > > *+---------+* > > > > > > > > > > *| *200702 * |* > > > > > > > > > > *+---------+* > > > > > > > > > > 1 row selected (0.826 seconds) > > > > > > > > > > {code} > > > > > > > > > > > > > > > {code} > > > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ'; > > > > > > > > > > *+---------+* > > > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > > > *+---------+* > > > > > > > > > > *| *199916 * |* > > > > > > > > > > *+---------+* > > > > > > > > > > 1 row selected (0.851 seconds) > > > > > > > > > > {code} > > > > > > > > > > > > > > > {code} > > > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO'; > > > > > > > > > > *+---------+* > > > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > > > *+---------+* > > > > > > > > > > *| *199552 * |* > > > > > > > > > > *+---------+* > > > > > > > > > > 1 row selected (0.827 seconds) > > > > > {code} > > > > > > > > > > Thanks, > > > > > Khurram > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Abdelhakim Deneche > > > > > > > > Software Engineer > > > > > > > > <http://www.mapr.com/> > > > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > > < > > > > > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > > > > > > > > -- > > > Steven Phillips > > > Software Engineer > > > > > > mapr.com > > > > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >
