Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Khurram Faraaz Sat, 29 Oct 2016 13:43:05 -0700

Done DRILL-3214 <https://issues.apache.org/jira/browse/DRILL-3214> is now
re-opened.


On Sun, Oct 30, 2016 at 1:00 AM, Andries Engelbrecht <
[email protected]> wrote:

> The JIRA should be reopened, as the point is for the reader to assign null
> values to empty fields, which should then have the functions interpret it
> as null values as well. I forgot about this option till Veera brought it up.
>
> --Andries
>
> > On Oct 29, 2016, at 11:57 AM, Khurram Faraaz <[email protected]>
> wrote:
> >
> > All,
> >
> > Can we please have functions in Drill like (TO_DATE, TO_TIME,
> TO_TIMESTAMP
> > etc.) handle empty fields and produce a null value.
> >
> > DRILL-3214 was marked as Resolved, I don't think that issue is Fixed.
> >
> > Thanks,
> > Khurram
> >
> > On Sat, Oct 29, 2016 at 2:42 AM, Veera Naranammalpuram <
> > [email protected]> wrote:
> >
> >> I would expect it to work. You should just reopen DRILL-3214. You have
> >> already created one for this.
> >>
> >> -Veera
> >>
> >> On Fri, Oct 28, 2016 at 3:08 PM, Andries Engelbrecht <
> >> [email protected]> wrote:
> >>
> >>> You want to use MM for month and not mm for minute as imm can produce
> the
> >>> wornd result.
> >>>
> >>> Probably best to file an enhancement JIRA to have the function handle
> >>> empty fields and produce a null value. Then the wider audience can
> review
> >>> the merit for implementation.
> >>>
> >>> --Andries
> >>>
> >>>
> >>>> On Oct 28, 2016, at 9:09 AM, Khurram Faraaz <[email protected]>
> >>> wrote:
> >>>>
> >>>> Thanks Andries and Veera.
> >>>>
> >>>> 1. Yes, my CSV file does have empty strings in some rows in
> columns[4].
> >>>> 2. it worked for parquet because I had used the case expression to
> cast
> >>>> empty strings to NULL.
> >>>> 3. I tried with 'yyyy-mm-dd' and 'yyyy-MM-dd' and to_Date returned
> >>> results
> >>>> with both representations.
> >>>>
> >>>> Question - Shouldn't Drill handle such empty strings that are within
> >> rows
> >>>> in CSV files ?
> >>>>                Why should user have to take care such cases ?
> >>>>
> >>>> Regards,
> >>>> Khurram
> >>>>
> >>>> On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> That should work and a lot faster too. Thanks for the pointer.
> >>>>>
> >>>>> -Veera
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Good catch on empty string Veera!
> >>>>>>
> >>>>>> Wouldn't it be cheaper to check for an empty string?
> >>>>>> case when columns[] ='' then null else
> to_date(columns[],'yyyy-MM-dd'
> >> )
> >>>>> end
> >>>>>>
> >>>>>> I don't think the option to read csv empty columns (or empty string
> >> in
> >>>>> any
> >>>>>> text reader) as null is in the reader yet. So we can't check with
> >>>>> columns[]
> >>>>>> is null.
> >>>>>>
> >>>>>>
> >>>>>> --Andries
> >>>>>>
> >>>>>>
> >>>>>>> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> >>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>> Do you have zero length strings in your data? I have seen cases
> >> where
> >>>>> the
> >>>>>>> system option to cast empty strings to NULL doesn't work as
> >>> advertised.
> >>>>>> You
> >>>>>>> should re-open DRILL-3214.
> >>>>>>>
> >>>>>>> When I run into this problem, I usually use a regex to workaround.
> >> The
> >>>>>>> PROJECT takes a performance hit when you do this for larger data
> >> sets
> >>>>> but
> >>>>>>> it works.
> >>>>>>>
> >>>>>>> $cat nulls.psv
> >>>>>>> date_col|string_col
> >>>>>>> |test
> >>>>>>> 2016-10-28|test2
> >>>>>>> $ sqlline
> >>>>>>> apache drill 1.8.0
> >>>>>>> "a little sql for your nosql"
> >>>>>>> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> >>>>>>> +-------------+-------------+
> >>>>>>> |  date_col   | string_col  |
> >>>>>>> +-------------+-------------+
> >>>>>>> |             | test        |
> >>>>>>> | 2016-10-28  | test2       |
> >>>>>>> +-------------+-------------+
> >>>>>>> 2 rows selected (0.303 seconds)
> >>>>>>> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from
> >>> `nulls.psv`;
> >>>>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>>>>>>
> >>>>>>> Fragment 0:0
> >>>>>>>
> >>>>>>> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> >>>>>> se-node10.se.lab:31010]
> >>>>>>> (state=,code=0)
> >>>>>>> 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%'
> then
> >>>>>>> to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted
> >>> from
> >>>>>>> `nulls.psv`;
> >>>>>>> +---------------------+
> >>>>>>> | date_col_converted  |
> >>>>>>> +---------------------+
> >>>>>>> | null                |
> >>>>>>> | 2016-10-28          |
> >>>>>>> +---------------------+
> >>>>>>> 2 rows selected (0.521 seconds)
> >>>>>>> 0: jdbc:drill:> alter system set
> >>>>>>> `drill.exec.functions.cast_empty_string_to_null` = true;
> >>>>>>> +-------+---------------------------------------------------
> >> -------+
> >>>>>>> |  ok   |                         summary
> |
> >>>>>>> +-------+---------------------------------------------------
> >> -------+
> >>>>>>> | true  | drill.exec.functions.cast_empty_string_to_null updated.
> >> |
> >>>>>>> +-------+---------------------------------------------------
> >> -------+
> >>>>>>> 1 row selected (0.304 seconds)
> >>>>>>> 0: jdbc:drill:>  select to_date(date_col,'yyyy-mm-dd') from
> >>>>> `nulls.psv`;
> >>>>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>>>>>>
> >>>>>>> Fragment 0:0
> >>>>>>>
> >>>>>>> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> >>>>>> se-node10.se.lab:31010]
> >>>>>>> (state=,code=0)
> >>>>>>> 0: jdbc:drill:>
> >>>>>>>
> >>>>>>> -Veera
> >>>>>>>
> >>>>>>> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz <
> >> [email protected]
> >>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> All,
> >>>>>>>>
> >>>>>>>> Question is - why does it work for a parquet column and fails when
> >>> CSV
> >>>>>>>> column is used ?
> >>>>>>>>
> >>>>>>>> Drill 1.9.0 commit : a29f1e29
> >>>>>>>>
> >>>>>>>> This is a simple project of column from a csv file, works.
> >>>>>>>> {noformat}
> >>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM
> >> `typeall_l.csv`
> >>>>> t1
> >>>>>>>> limit 5;
> >>>>>>>> +-------------+
> >>>>>>>> |   EXPR$0    |
> >>>>>>>> +-------------+
> >>>>>>>> | 2011-11-04  |
> >>>>>>>> | 1986-10-22  |
> >>>>>>>> | 1992-09-10  |
> >>>>>>>> | 2016-08-07  |
> >>>>>>>> | 1986-01-25  |
> >>>>>>>> +-------------+
> >>>>>>>> 5 rows selected (0.26 seconds)
> >>>>>>>> {noformat}
> >>>>>>>>
> >>>>>>>> Using TO_DATE function with columns[x] as first input fails, with
> >> an
> >>>>>>>> IllegalArgumentException
> >>>>>>>> {noformat}
> >>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm-
> >>> dd')
> >>>>>> FROM
> >>>>>>>> `typeall_l.csv` t1 limit 5;
> >>>>>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>>>>>>>
> >>>>>>>> Fragment 0:0
> >>>>>>>>
> >>>>>>>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
> >>>>>> centos-01.qa.lab:31010]
> >>>>>>>> (state=,code=0)
> >>>>>>>> {noformat}
> >>>>>>>>
> >>>>>>>> However, interestingly same query over parquet column returns
> >> correct
> >>>>>>>> results, on same data.
> >>>>>>>>
> >>>>>>>> {noformat}
> >>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd')
> >>> FROM
> >>>>>>>> typeall_l limit 5;
> >>>>>>>> +-------------+
> >>>>>>>> |   EXPR$0    |
> >>>>>>>> +-------------+
> >>>>>>>> | 2011-01-04  |
> >>>>>>>> | 1986-01-22  |
> >>>>>>>> | 1992-01-10  |
> >>>>>>>> | 2016-01-07  |
> >>>>>>>> | 1986-01-25  |
> >>>>>>>> +-------------+
> >>>>>>>> 5 rows selected (0.286 seconds)
> >>>>>>>> {noformat}
> >>>>>>>>
> >>>>>>>> When the date string is passed as first input, to_date function
> >>>>> returns
> >>>>>>>> correct results.
> >>>>>>>> {noformat}
> >>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select
> to_date('2011-01-04','yyyy-mm-
> >>>>> dd')
> >>>>>>>> from (values(1));
> >>>>>>>> +-------------+
> >>>>>>>> |   EXPR$0    |
> >>>>>>>> +-------------+
> >>>>>>>> | 2011-01-04  |
> >>>>>>>> +-------------+
> >>>>>>>> 1 row selected (0.235 seconds)
> >>>>>>>> {noformat}
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Khurram
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Veera Naranammalpuram
> >>>>>>> Product Specialist - SQL on Hadoop
> >>>>>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >>>>>>> *(Email) [email protected] <
> >> [email protected]>
> >>> *
> >>>>>>> *(Mobile) 917 683 8116 - can text *
> >>>>>>> *Timezone: ET (UTC -5:00 / -4:00)*
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Veera Naranammalpuram
> >>>>> Product Specialist - SQL on Hadoop
> >>>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >>>>> *(Email) [email protected] <[email protected]
> >
> >> *
> >>>>> *(Mobile) 917 683 8116 - can text *
> >>>>> *Timezone: ET (UTC -5:00 / -4:00)*
> >>>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Veera Naranammalpuram
> >> Product Specialist - SQL on Hadoop
> >> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >> *(Email) [email protected] <[email protected]>*
> >> *(Mobile) 917 683 8116 - can text *
> >> *Timezone: ET (UTC -5:00 / -4:00)*
> >>
>
>

Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Reply via email to