Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Veera Naranammalpuram Fri, 28 Oct 2016 14:12:59 -0700

I would expect it to work. You should just reopen DRILL-3214. You have
already created one for this.


-Veera

On Fri, Oct 28, 2016 at 3:08 PM, Andries Engelbrecht <
[email protected]> wrote:

> You want to use MM for month and not mm for minute as imm can produce the
> wornd result.
>
> Probably best to file an enhancement JIRA to have the function handle
> empty fields and produce a null value. Then the wider audience can review
> the merit for implementation.
>
> --Andries
>
>
> > On Oct 28, 2016, at 9:09 AM, Khurram Faraaz <[email protected]>
> wrote:
> >
> > Thanks Andries and Veera.
> >
> > 1. Yes, my CSV file does have empty strings in some rows in columns[4].
> > 2. it worked for parquet because I had used the case expression to cast
> > empty strings to NULL.
> > 3. I tried with 'yyyy-mm-dd' and 'yyyy-MM-dd' and to_Date returned
> results
> > with both representations.
> >
> > Question - Shouldn't Drill handle such empty strings that are within rows
> > in CSV files ?
> >                 Why should user have to take care such cases ?
> >
> > Regards,
> > Khurram
> >
> > On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
> > [email protected]> wrote:
> >
> >> That should work and a lot faster too. Thanks for the pointer.
> >>
> >> -Veera
> >>
> >> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
> >> [email protected]> wrote:
> >>
> >>> Good catch on empty string Veera!
> >>>
> >>> Wouldn't it be cheaper to check for an empty string?
> >>> case when columns[] ='' then null else to_date(columns[],'yyyy-MM-dd')
> >> end
> >>>
> >>> I don't think the option to read csv empty columns (or empty string in
> >> any
> >>> text reader) as null is in the reader yet. So we can't check with
> >> columns[]
> >>> is null.
> >>>
> >>>
> >>> --Andries
> >>>
> >>>
> >>>> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> >>> [email protected]> wrote:
> >>>>
> >>>> Do you have zero length strings in your data? I have seen cases where
> >> the
> >>>> system option to cast empty strings to NULL doesn't work as
> advertised.
> >>> You
> >>>> should re-open DRILL-3214.
> >>>>
> >>>> When I run into this problem, I usually use a regex to workaround. The
> >>>> PROJECT takes a performance hit when you do this for larger data sets
> >> but
> >>>> it works.
> >>>>
> >>>> $cat nulls.psv
> >>>> date_col|string_col
> >>>> |test
> >>>> 2016-10-28|test2
> >>>> $ sqlline
> >>>> apache drill 1.8.0
> >>>> "a little sql for your nosql"
> >>>> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> >>>> +-------------+-------------+
> >>>> |  date_col   | string_col  |
> >>>> +-------------+-------------+
> >>>> |             | test        |
> >>>> | 2016-10-28  | test2       |
> >>>> +-------------+-------------+
> >>>> 2 rows selected (0.303 seconds)
> >>>> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from
> `nulls.psv`;
> >>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>>>
> >>>> Fragment 0:0
> >>>>
> >>>> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> >>> se-node10.se.lab:31010]
> >>>> (state=,code=0)
> >>>> 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> >>>> to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted
> from
> >>>> `nulls.psv`;
> >>>> +---------------------+
> >>>> | date_col_converted  |
> >>>> +---------------------+
> >>>> | null                |
> >>>> | 2016-10-28          |
> >>>> +---------------------+
> >>>> 2 rows selected (0.521 seconds)
> >>>> 0: jdbc:drill:> alter system set
> >>>> `drill.exec.functions.cast_empty_string_to_null` = true;
> >>>> +-------+----------------------------------------------------------+
> >>>> |  ok   |                         summary                          |
> >>>> +-------+----------------------------------------------------------+
> >>>> | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> >>>> +-------+----------------------------------------------------------+
> >>>> 1 row selected (0.304 seconds)
> >>>> 0: jdbc:drill:>  select to_date(date_col,'yyyy-mm-dd') from
> >> `nulls.psv`;
> >>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>>>
> >>>> Fragment 0:0
> >>>>
> >>>> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> >>> se-node10.se.lab:31010]
> >>>> (state=,code=0)
> >>>> 0: jdbc:drill:>
> >>>>
> >>>> -Veera
> >>>>
> >>>> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz <[email protected]
> >
> >>>> wrote:
> >>>>
> >>>>> All,
> >>>>>
> >>>>> Question is - why does it work for a parquet column and fails when
> CSV
> >>>>> column is used ?
> >>>>>
> >>>>> Drill 1.9.0 commit : a29f1e29
> >>>>>
> >>>>> This is a simple project of column from a csv file, works.
> >>>>> {noformat}
> >>>>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv`
> >> t1
> >>>>> limit 5;
> >>>>> +-------------+
> >>>>> |   EXPR$0    |
> >>>>> +-------------+
> >>>>> | 2011-11-04  |
> >>>>> | 1986-10-22  |
> >>>>> | 1992-09-10  |
> >>>>> | 2016-08-07  |
> >>>>> | 1986-01-25  |
> >>>>> +-------------+
> >>>>> 5 rows selected (0.26 seconds)
> >>>>> {noformat}
> >>>>>
> >>>>> Using TO_DATE function with columns[x] as first input fails, with an
> >>>>> IllegalArgumentException
> >>>>> {noformat}
> >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm-
> dd')
> >>> FROM
> >>>>> `typeall_l.csv` t1 limit 5;
> >>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>>>>
> >>>>> Fragment 0:0
> >>>>>
> >>>>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
> >>> centos-01.qa.lab:31010]
> >>>>> (state=,code=0)
> >>>>> {noformat}
> >>>>>
> >>>>> However, interestingly same query over parquet column returns correct
> >>>>> results, on same data.
> >>>>>
> >>>>> {noformat}
> >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd')
> FROM
> >>>>> typeall_l limit 5;
> >>>>> +-------------+
> >>>>> |   EXPR$0    |
> >>>>> +-------------+
> >>>>> | 2011-01-04  |
> >>>>> | 1986-01-22  |
> >>>>> | 1992-01-10  |
> >>>>> | 2016-01-07  |
> >>>>> | 1986-01-25  |
> >>>>> +-------------+
> >>>>> 5 rows selected (0.286 seconds)
> >>>>> {noformat}
> >>>>>
> >>>>> When the date string is passed as first input, to_date function
> >> returns
> >>>>> correct results.
> >>>>> {noformat}
> >>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','yyyy-mm-
> >> dd')
> >>>>> from (values(1));
> >>>>> +-------------+
> >>>>> |   EXPR$0    |
> >>>>> +-------------+
> >>>>> | 2011-01-04  |
> >>>>> +-------------+
> >>>>> 1 row selected (0.235 seconds)
> >>>>> {noformat}
> >>>>>
> >>>>> Thanks,
> >>>>> Khurram
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Veera Naranammalpuram
> >>>> Product Specialist - SQL on Hadoop
> >>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >>>> *(Email) [email protected] <[email protected]>
> *
> >>>> *(Mobile) 917 683 8116 - can text *
> >>>> *Timezone: ET (UTC -5:00 / -4:00)*
> >>>
> >>>
> >>
> >>
> >> --
> >> Veera Naranammalpuram
> >> Product Specialist - SQL on Hadoop
> >> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >> *(Email) [email protected] <[email protected]>*
> >> *(Mobile) 917 683 8116 - can text *
> >> *Timezone: ET (UTC -5:00 / -4:00)*
> >>
>
>


-- 
Veera Naranammalpuram
Product Specialist - SQL on Hadoop
*MapR Technologies (www.mapr.com <http://www.mapr.com>)*
*(Email) [email protected] <[email protected]>*
*(Mobile) 917 683 8116 - can text *
*Timezone: ET (UTC -5:00 / -4:00)*

Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Reply via email to