Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Andries Engelbrecht Sat, 29 Oct 2016 12:32:01 -0700

The JIRA should be reopened, as the point is for the reader to assign null 
values to empty fields, which should then have the functions interpret it as 
null values as well. I forgot about this option till Veera brought it up.


--Andries

> On Oct 29, 2016, at 11:57 AM, Khurram Faraaz <[email protected]> wrote:
> 
> All,
> 
> Can we please have functions in Drill like (TO_DATE, TO_TIME, TO_TIMESTAMP
> etc.) handle empty fields and produce a null value.
> 
> DRILL-3214 was marked as Resolved, I don't think that issue is Fixed.
> 
> Thanks,
> Khurram
> 
> On Sat, Oct 29, 2016 at 2:42 AM, Veera Naranammalpuram <
> [email protected]> wrote:
> 
>> I would expect it to work. You should just reopen DRILL-3214. You have
>> already created one for this.
>> 
>> -Veera
>> 
>> On Fri, Oct 28, 2016 at 3:08 PM, Andries Engelbrecht <
>> [email protected]> wrote:
>> 
>>> You want to use MM for month and not mm for minute as imm can produce the
>>> wornd result.
>>> 
>>> Probably best to file an enhancement JIRA to have the function handle
>>> empty fields and produce a null value. Then the wider audience can review
>>> the merit for implementation.
>>> 
>>> --Andries
>>> 
>>> 
>>>> On Oct 28, 2016, at 9:09 AM, Khurram Faraaz <[email protected]>
>>> wrote:
>>>> 
>>>> Thanks Andries and Veera.
>>>> 
>>>> 1. Yes, my CSV file does have empty strings in some rows in columns[4].
>>>> 2. it worked for parquet because I had used the case expression to cast
>>>> empty strings to NULL.
>>>> 3. I tried with 'yyyy-mm-dd' and 'yyyy-MM-dd' and to_Date returned
>>> results
>>>> with both representations.
>>>> 
>>>> Question - Shouldn't Drill handle such empty strings that are within
>> rows
>>>> in CSV files ?
>>>>                Why should user have to take care such cases ?
>>>> 
>>>> Regards,
>>>> Khurram
>>>> 
>>>> On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
>>>> [email protected]> wrote:
>>>> 
>>>>> That should work and a lot faster too. Thanks for the pointer.
>>>>> 
>>>>> -Veera
>>>>> 
>>>>> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Good catch on empty string Veera!
>>>>>> 
>>>>>> Wouldn't it be cheaper to check for an empty string?
>>>>>> case when columns[] ='' then null else to_date(columns[],'yyyy-MM-dd'
>> )
>>>>> end
>>>>>> 
>>>>>> I don't think the option to read csv empty columns (or empty string
>> in
>>>>> any
>>>>>> text reader) as null is in the reader yet. So we can't check with
>>>>> columns[]
>>>>>> is null.
>>>>>> 
>>>>>> 
>>>>>> --Andries
>>>>>> 
>>>>>> 
>>>>>>> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>> Do you have zero length strings in your data? I have seen cases
>> where
>>>>> the
>>>>>>> system option to cast empty strings to NULL doesn't work as
>>> advertised.
>>>>>> You
>>>>>>> should re-open DRILL-3214.
>>>>>>> 
>>>>>>> When I run into this problem, I usually use a regex to workaround.
>> The
>>>>>>> PROJECT takes a performance hit when you do this for larger data
>> sets
>>>>> but
>>>>>>> it works.
>>>>>>> 
>>>>>>> $cat nulls.psv
>>>>>>> date_col|string_col
>>>>>>> |test
>>>>>>> 2016-10-28|test2
>>>>>>> $ sqlline
>>>>>>> apache drill 1.8.0
>>>>>>> "a little sql for your nosql"
>>>>>>> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
>>>>>>> +-------------+-------------+
>>>>>>> |  date_col   | string_col  |
>>>>>>> +-------------+-------------+
>>>>>>> |             | test        |
>>>>>>> | 2016-10-28  | test2       |
>>>>>>> +-------------+-------------+
>>>>>>> 2 rows selected (0.303 seconds)
>>>>>>> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from
>>> `nulls.psv`;
>>>>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
>>>>>>> 
>>>>>>> Fragment 0:0
>>>>>>> 
>>>>>>> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
>>>>>> se-node10.se.lab:31010]
>>>>>>> (state=,code=0)
>>>>>>> 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
>>>>>>> to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted
>>> from
>>>>>>> `nulls.psv`;
>>>>>>> +---------------------+
>>>>>>> | date_col_converted  |
>>>>>>> +---------------------+
>>>>>>> | null                |
>>>>>>> | 2016-10-28          |
>>>>>>> +---------------------+
>>>>>>> 2 rows selected (0.521 seconds)
>>>>>>> 0: jdbc:drill:> alter system set
>>>>>>> `drill.exec.functions.cast_empty_string_to_null` = true;
>>>>>>> +-------+---------------------------------------------------
>> -------+
>>>>>>> |  ok   |                         summary                          |
>>>>>>> +-------+---------------------------------------------------
>> -------+
>>>>>>> | true  | drill.exec.functions.cast_empty_string_to_null updated.
>> |
>>>>>>> +-------+---------------------------------------------------
>> -------+
>>>>>>> 1 row selected (0.304 seconds)
>>>>>>> 0: jdbc:drill:>  select to_date(date_col,'yyyy-mm-dd') from
>>>>> `nulls.psv`;
>>>>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
>>>>>>> 
>>>>>>> Fragment 0:0
>>>>>>> 
>>>>>>> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
>>>>>> se-node10.se.lab:31010]
>>>>>>> (state=,code=0)
>>>>>>> 0: jdbc:drill:>
>>>>>>> 
>>>>>>> -Veera
>>>>>>> 
>>>>>>> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz <
>> [email protected]
>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> All,
>>>>>>>> 
>>>>>>>> Question is - why does it work for a parquet column and fails when
>>> CSV
>>>>>>>> column is used ?
>>>>>>>> 
>>>>>>>> Drill 1.9.0 commit : a29f1e29
>>>>>>>> 
>>>>>>>> This is a simple project of column from a csv file, works.
>>>>>>>> {noformat}
>>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM
>> `typeall_l.csv`
>>>>> t1
>>>>>>>> limit 5;
>>>>>>>> +-------------+
>>>>>>>> |   EXPR$0    |
>>>>>>>> +-------------+
>>>>>>>> | 2011-11-04  |
>>>>>>>> | 1986-10-22  |
>>>>>>>> | 1992-09-10  |
>>>>>>>> | 2016-08-07  |
>>>>>>>> | 1986-01-25  |
>>>>>>>> +-------------+
>>>>>>>> 5 rows selected (0.26 seconds)
>>>>>>>> {noformat}
>>>>>>>> 
>>>>>>>> Using TO_DATE function with columns[x] as first input fails, with
>> an
>>>>>>>> IllegalArgumentException
>>>>>>>> {noformat}
>>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm-
>>> dd')
>>>>>> FROM
>>>>>>>> `typeall_l.csv` t1 limit 5;
>>>>>>>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
>>>>>>>> 
>>>>>>>> Fragment 0:0
>>>>>>>> 
>>>>>>>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
>>>>>> centos-01.qa.lab:31010]
>>>>>>>> (state=,code=0)
>>>>>>>> {noformat}
>>>>>>>> 
>>>>>>>> However, interestingly same query over parquet column returns
>> correct
>>>>>>>> results, on same data.
>>>>>>>> 
>>>>>>>> {noformat}
>>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd')
>>> FROM
>>>>>>>> typeall_l limit 5;
>>>>>>>> +-------------+
>>>>>>>> |   EXPR$0    |
>>>>>>>> +-------------+
>>>>>>>> | 2011-01-04  |
>>>>>>>> | 1986-01-22  |
>>>>>>>> | 1992-01-10  |
>>>>>>>> | 2016-01-07  |
>>>>>>>> | 1986-01-25  |
>>>>>>>> +-------------+
>>>>>>>> 5 rows selected (0.286 seconds)
>>>>>>>> {noformat}
>>>>>>>> 
>>>>>>>> When the date string is passed as first input, to_date function
>>>>> returns
>>>>>>>> correct results.
>>>>>>>> {noformat}
>>>>>>>> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','yyyy-mm-
>>>>> dd')
>>>>>>>> from (values(1));
>>>>>>>> +-------------+
>>>>>>>> |   EXPR$0    |
>>>>>>>> +-------------+
>>>>>>>> | 2011-01-04  |
>>>>>>>> +-------------+
>>>>>>>> 1 row selected (0.235 seconds)
>>>>>>>> {noformat}
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Khurram
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Veera Naranammalpuram
>>>>>>> Product Specialist - SQL on Hadoop
>>>>>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>>>>>>> *(Email) [email protected] <
>> [email protected]>
>>> *
>>>>>>> *(Mobile) 917 683 8116 - can text *
>>>>>>> *Timezone: ET (UTC -5:00 / -4:00)*
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Veera Naranammalpuram
>>>>> Product Specialist - SQL on Hadoop
>>>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>>>>> *(Email) [email protected] <[email protected]>
>> *
>>>>> *(Mobile) 917 683 8116 - can text *
>>>>> *Timezone: ET (UTC -5:00 / -4:00)*
>>>>> 
>>> 
>>> 
>> 
>> 
>> --
>> Veera Naranammalpuram
>> Product Specialist - SQL on Hadoop
>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>> *(Email) [email protected] <[email protected]>*
>> *(Mobile) 917 683 8116 - can text *
>> *Timezone: ET (UTC -5:00 / -4:00)*
>>

Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

Reply via email to