Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Khurram Faraaz Sat, 29 Oct 2016 11:58:20 -0700

All,

Can we please have functions in Drill like (TO_DATE, TO_TIME, TO_TIMESTAMP
etc.) handle empty fields and produce a null value.


DRILL-3214 was marked as Resolved, I don't think that issue is Fixed.

Thanks,
Khurram

On Sat, Oct 29, 2016 at 2:42 AM, Veera Naranammalpuram <
[email protected]> wrote:

> I would expect it to work. You should just reopen DRILL-3214. You have
> already created one for this.
>
> -Veera
>
> On Fri, Oct 28, 2016 at 3:08 PM, Andries Engelbrecht <
> [email protected]> wrote:
>
> > You want to use MM for month and not mm for minute as imm can produce the
> > wornd result.
> >
> > Probably best to file an enhancement JIRA to have the function handle
> > empty fields and produce a null value. Then the wider audience can review
> > the merit for implementation.
> >
> > --Andries
> >
> >
> > > On Oct 28, 2016, at 9:09 AM, Khurram Faraaz <[email protected]>
> > wrote:
> > >
> > > Thanks Andries and Veera.
> > >
> > > 1. Yes, my CSV file does have empty strings in some rows in columns[4].
> > > 2. it worked for parquet because I had used the case expression to cast
> > > empty strings to NULL.
> > > 3. I tried with 'yyyy-mm-dd' and 'yyyy-MM-dd' and to_Date returned
> > results
> > > with both representations.
> > >
> > > Question - Shouldn't Drill handle such empty strings that are within
> rows
> > > in CSV files ?
> > >                 Why should user have to take care such cases ?
> > >
> > > Regards,
> > > Khurram
> > >
> > > On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
> > > [email protected]> wrote:
> > >
> > >> That should work and a lot faster too. Thanks for the pointer.
> > >>
> > >> -Veera
> > >>
> > >> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
> > >> [email protected]> wrote:
> > >>
> > >>> Good catch on empty string Veera!
> > >>>
> > >>> Wouldn't it be cheaper to check for an empty string?
> > >>> case when columns[] ='' then null else to_date(columns[],'yyyy-MM-dd'
> )
> > >> end
> > >>>
> > >>> I don't think the option to read csv empty columns (or empty string
> in
> > >> any
> > >>> text reader) as null is in the reader yet. So we can't check with
> > >> columns[]
> > >>> is null.
> > >>>
> > >>>
> > >>> --Andries
> > >>>
> > >>>
> > >>>> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> > >>> [email protected]> wrote:
> > >>>>
> > >>>> Do you have zero length strings in your data? I have seen cases
> where
> > >> the
> > >>>> system option to cast empty strings to NULL doesn't work as
> > advertised.
> > >>> You
> > >>>> should re-open DRILL-3214.
> > >>>>
> > >>>> When I run into this problem, I usually use a regex to workaround.
> The
> > >>>> PROJECT takes a performance hit when you do this for larger data
> sets
> > >> but
> > >>>> it works.
> > >>>>
> > >>>> $cat nulls.psv
> > >>>> date_col|string_col
> > >>>> |test
> > >>>> 2016-10-28|test2
> > >>>> $ sqlline
> > >>>> apache drill 1.8.0
> > >>>> "a little sql for your nosql"
> > >>>> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> > >>>> +-------------+-------------+
> > >>>> |  date_col   | string_col  |
> > >>>> +-------------+-------------+
> > >>>> |             | test        |
> > >>>> | 2016-10-28  | test2       |
> > >>>> +-------------+-------------+
> > >>>> 2 rows selected (0.303 seconds)
> > >>>> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from
> > `nulls.psv`;
> > >>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >>>>
> > >>>> Fragment 0:0
> > >>>>
> > >>>> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> > >>> se-node10.se.lab:31010]
> > >>>> (state=,code=0)
> > >>>> 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> > >>>> to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted
> > from
> > >>>> `nulls.psv`;
> > >>>> +---------------------+
> > >>>> | date_col_converted  |
> > >>>> +---------------------+
> > >>>> | null                |
> > >>>> | 2016-10-28          |
> > >>>> +---------------------+
> > >>>> 2 rows selected (0.521 seconds)
> > >>>> 0: jdbc:drill:> alter system set
> > >>>> `drill.exec.functions.cast_empty_string_to_null` = true;
> > >>>> +-------+---------------------------------------------------
> -------+
> > >>>> |  ok   |                         summary                          |
> > >>>> +-------+---------------------------------------------------
> -------+
> > >>>> | true  | drill.exec.functions.cast_empty_string_to_null updated.
> |
> > >>>> +-------+---------------------------------------------------
> -------+
> > >>>> 1 row selected (0.304 seconds)
> > >>>> 0: jdbc:drill:>  select to_date(date_col,'yyyy-mm-dd') from
> > >> `nulls.psv`;
> > >>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >>>>
> > >>>> Fragment 0:0
> > >>>>
> > >>>> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> > >>> se-node10.se.lab:31010]
> > >>>> (state=,code=0)
> > >>>> 0: jdbc:drill:>
> > >>>>
> > >>>> -Veera
> > >>>>
> > >>>> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz <
> [email protected]
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> All,
> > >>>>>
> > >>>>> Question is - why does it work for a parquet column and fails when
> > CSV
> > >>>>> column is used ?
> > >>>>>
> > >>>>> Drill 1.9.0 commit : a29f1e29
> > >>>>>
> > >>>>> This is a simple project of column from a csv file, works.
> > >>>>> {noformat}
> > >>>>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM
> `typeall_l.csv`
> > >> t1
> > >>>>> limit 5;
> > >>>>> +-------------+
> > >>>>> |   EXPR$0    |
> > >>>>> +-------------+
> > >>>>> | 2011-11-04  |
> > >>>>> | 1986-10-22  |
> > >>>>> | 1992-09-10  |
> > >>>>> | 2016-08-07  |
> > >>>>> | 1986-01-25  |
> > >>>>> +-------------+
> > >>>>> 5 rows selected (0.26 seconds)
> > >>>>> {noformat}
> > >>>>>
> > >>>>> Using TO_DATE function with columns[x] as first input fails, with
> an
> > >>>>> IllegalArgumentException
> > >>>>> {noformat}
> > >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm-
> > dd')
> > >>> FROM
> > >>>>> `typeall_l.csv` t1 limit 5;
> > >>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >>>>>
> > >>>>> Fragment 0:0
> > >>>>>
> > >>>>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
> > >>> centos-01.qa.lab:31010]
> > >>>>> (state=,code=0)
> > >>>>> {noformat}
> > >>>>>
> > >>>>> However, interestingly same query over parquet column returns
> correct
> > >>>>> results, on same data.
> > >>>>>
> > >>>>> {noformat}
> > >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd')
> > FROM
> > >>>>> typeall_l limit 5;
> > >>>>> +-------------+
> > >>>>> |   EXPR$0    |
> > >>>>> +-------------+
> > >>>>> | 2011-01-04  |
> > >>>>> | 1986-01-22  |
> > >>>>> | 1992-01-10  |
> > >>>>> | 2016-01-07  |
> > >>>>> | 1986-01-25  |
> > >>>>> +-------------+
> > >>>>> 5 rows selected (0.286 seconds)
> > >>>>> {noformat}
> > >>>>>
> > >>>>> When the date string is passed as first input, to_date function
> > >> returns
> > >>>>> correct results.
> > >>>>> {noformat}
> > >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','yyyy-mm-
> > >> dd')
> > >>>>> from (values(1));
> > >>>>> +-------------+
> > >>>>> |   EXPR$0    |
> > >>>>> +-------------+
> > >>>>> | 2011-01-04  |
> > >>>>> +-------------+
> > >>>>> 1 row selected (0.235 seconds)
> > >>>>> {noformat}
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Khurram
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Veera Naranammalpuram
> > >>>> Product Specialist - SQL on Hadoop
> > >>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> > >>>> *(Email) [email protected] <
> [email protected]>
> > *
> > >>>> *(Mobile) 917 683 8116 - can text *
> > >>>> *Timezone: ET (UTC -5:00 / -4:00)*
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Veera Naranammalpuram
> > >> Product Specialist - SQL on Hadoop
> > >> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> > >> *(Email) [email protected] <[email protected]>
> *
> > >> *(Mobile) 917 683 8116 - can text *
> > >> *Timezone: ET (UTC -5:00 / -4:00)*
> > >>
> >
> >
>
>
> --
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> *(Email) [email protected] <[email protected]>*
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*
>

Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Reply via email to