All, Can we please have functions in Drill like (TO_DATE, TO_TIME, TO_TIMESTAMP etc.) handle empty fields and produce a null value.
DRILL-3214 was marked as Resolved, I don't think that issue is Fixed. Thanks, Khurram On Sat, Oct 29, 2016 at 2:42 AM, Veera Naranammalpuram < [email protected]> wrote: > I would expect it to work. You should just reopen DRILL-3214. You have > already created one for this. > > -Veera > > On Fri, Oct 28, 2016 at 3:08 PM, Andries Engelbrecht < > [email protected]> wrote: > > > You want to use MM for month and not mm for minute as imm can produce the > > wornd result. > > > > Probably best to file an enhancement JIRA to have the function handle > > empty fields and produce a null value. Then the wider audience can review > > the merit for implementation. > > > > --Andries > > > > > > > On Oct 28, 2016, at 9:09 AM, Khurram Faraaz <[email protected]> > > wrote: > > > > > > Thanks Andries and Veera. > > > > > > 1. Yes, my CSV file does have empty strings in some rows in columns[4]. > > > 2. it worked for parquet because I had used the case expression to cast > > > empty strings to NULL. > > > 3. I tried with 'yyyy-mm-dd' and 'yyyy-MM-dd' and to_Date returned > > results > > > with both representations. > > > > > > Question - Shouldn't Drill handle such empty strings that are within > rows > > > in CSV files ? > > > Why should user have to take care such cases ? > > > > > > Regards, > > > Khurram > > > > > > On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram < > > > [email protected]> wrote: > > > > > >> That should work and a lot faster too. Thanks for the pointer. > > >> > > >> -Veera > > >> > > >> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht < > > >> [email protected]> wrote: > > >> > > >>> Good catch on empty string Veera! > > >>> > > >>> Wouldn't it be cheaper to check for an empty string? > > >>> case when columns[] ='' then null else to_date(columns[],'yyyy-MM-dd' > ) > > >> end > > >>> > > >>> I don't think the option to read csv empty columns (or empty string > in > > >> any > > >>> text reader) as null is in the reader yet. So we can't check with > > >> columns[] > > >>> is null. > > >>> > > >>> > > >>> --Andries > > >>> > > >>> > > >>>> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram < > > >>> [email protected]> wrote: > > >>>> > > >>>> Do you have zero length strings in your data? I have seen cases > where > > >> the > > >>>> system option to cast empty strings to NULL doesn't work as > > advertised. > > >>> You > > >>>> should re-open DRILL-3214. > > >>>> > > >>>> When I run into this problem, I usually use a regex to workaround. > The > > >>>> PROJECT takes a performance hit when you do this for larger data > sets > > >> but > > >>>> it works. > > >>>> > > >>>> $cat nulls.psv > > >>>> date_col|string_col > > >>>> |test > > >>>> 2016-10-28|test2 > > >>>> $ sqlline > > >>>> apache drill 1.8.0 > > >>>> "a little sql for your nosql" > > >>>> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`; > > >>>> +-------------+-------------+ > > >>>> | date_col | string_col | > > >>>> +-------------+-------------+ > > >>>> | | test | > > >>>> | 2016-10-28 | test2 | > > >>>> +-------------+-------------+ > > >>>> 2 rows selected (0.303 seconds) > > >>>> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from > > `nulls.psv`; > > >>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: "" > > >>>> > > >>>> Fragment 0:0 > > >>>> > > >>>> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on > > >>> se-node10.se.lab:31010] > > >>>> (state=,code=0) > > >>>> 0: jdbc:drill:> select case when date_col similar to '[0-9]+%' then > > >>>> to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted > > from > > >>>> `nulls.psv`; > > >>>> +---------------------+ > > >>>> | date_col_converted | > > >>>> +---------------------+ > > >>>> | null | > > >>>> | 2016-10-28 | > > >>>> +---------------------+ > > >>>> 2 rows selected (0.521 seconds) > > >>>> 0: jdbc:drill:> alter system set > > >>>> `drill.exec.functions.cast_empty_string_to_null` = true; > > >>>> +-------+--------------------------------------------------- > -------+ > > >>>> | ok | summary | > > >>>> +-------+--------------------------------------------------- > -------+ > > >>>> | true | drill.exec.functions.cast_empty_string_to_null updated. > | > > >>>> +-------+--------------------------------------------------- > -------+ > > >>>> 1 row selected (0.304 seconds) > > >>>> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from > > >> `nulls.psv`; > > >>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: "" > > >>>> > > >>>> Fragment 0:0 > > >>>> > > >>>> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on > > >>> se-node10.se.lab:31010] > > >>>> (state=,code=0) > > >>>> 0: jdbc:drill:> > > >>>> > > >>>> -Veera > > >>>> > > >>>> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz < > [email protected] > > > > > >>>> wrote: > > >>>> > > >>>>> All, > > >>>>> > > >>>>> Question is - why does it work for a parquet column and fails when > > CSV > > >>>>> column is used ? > > >>>>> > > >>>>> Drill 1.9.0 commit : a29f1e29 > > >>>>> > > >>>>> This is a simple project of column from a csv file, works. > > >>>>> {noformat} > > >>>>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM > `typeall_l.csv` > > >> t1 > > >>>>> limit 5; > > >>>>> +-------------+ > > >>>>> | EXPR$0 | > > >>>>> +-------------+ > > >>>>> | 2011-11-04 | > > >>>>> | 1986-10-22 | > > >>>>> | 1992-09-10 | > > >>>>> | 2016-08-07 | > > >>>>> | 1986-01-25 | > > >>>>> +-------------+ > > >>>>> 5 rows selected (0.26 seconds) > > >>>>> {noformat} > > >>>>> > > >>>>> Using TO_DATE function with columns[x] as first input fails, with > an > > >>>>> IllegalArgumentException > > >>>>> {noformat} > > >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm- > > dd') > > >>> FROM > > >>>>> `typeall_l.csv` t1 limit 5; > > >>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: "" > > >>>>> > > >>>>> Fragment 0:0 > > >>>>> > > >>>>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on > > >>> centos-01.qa.lab:31010] > > >>>>> (state=,code=0) > > >>>>> {noformat} > > >>>>> > > >>>>> However, interestingly same query over parquet column returns > correct > > >>>>> results, on same data. > > >>>>> > > >>>>> {noformat} > > >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd') > > FROM > > >>>>> typeall_l limit 5; > > >>>>> +-------------+ > > >>>>> | EXPR$0 | > > >>>>> +-------------+ > > >>>>> | 2011-01-04 | > > >>>>> | 1986-01-22 | > > >>>>> | 1992-01-10 | > > >>>>> | 2016-01-07 | > > >>>>> | 1986-01-25 | > > >>>>> +-------------+ > > >>>>> 5 rows selected (0.286 seconds) > > >>>>> {noformat} > > >>>>> > > >>>>> When the date string is passed as first input, to_date function > > >> returns > > >>>>> correct results. > > >>>>> {noformat} > > >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','yyyy-mm- > > >> dd') > > >>>>> from (values(1)); > > >>>>> +-------------+ > > >>>>> | EXPR$0 | > > >>>>> +-------------+ > > >>>>> | 2011-01-04 | > > >>>>> +-------------+ > > >>>>> 1 row selected (0.235 seconds) > > >>>>> {noformat} > > >>>>> > > >>>>> Thanks, > > >>>>> Khurram > > >>>>> > > >>>> > > >>>> > > >>>> > > >>>> -- > > >>>> Veera Naranammalpuram > > >>>> Product Specialist - SQL on Hadoop > > >>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)* > > >>>> *(Email) [email protected] < > [email protected]> > > * > > >>>> *(Mobile) 917 683 8116 - can text * > > >>>> *Timezone: ET (UTC -5:00 / -4:00)* > > >>> > > >>> > > >> > > >> > > >> -- > > >> Veera Naranammalpuram > > >> Product Specialist - SQL on Hadoop > > >> *MapR Technologies (www.mapr.com <http://www.mapr.com>)* > > >> *(Email) [email protected] <[email protected]> > * > > >> *(Mobile) 917 683 8116 - can text * > > >> *Timezone: ET (UTC -5:00 / -4:00)* > > >> > > > > > > > -- > Veera Naranammalpuram > Product Specialist - SQL on Hadoop > *MapR Technologies (www.mapr.com <http://www.mapr.com>)* > *(Email) [email protected] <[email protected]>* > *(Mobile) 917 683 8116 - can text * > *Timezone: ET (UTC -5:00 / -4:00)* >
