Re: [FastBit-users] csv with NULLs

Michael Beauregard Wed, 07 Nov 2012 16:20:38 -0800

Hey John,

I had a quick look at this problem and found a trivial fix for it. Please
have a look at the attached patch.


Thanks!

Michael


On Tue, Nov 6, 2012 at 4:27 PM, Michael Beauregard <
[email protected]> wrote:

> Thankfully I'm able to avoid that problematic situation altogether.
> However, I have found what appears to be an issue with loading data from
> csv that has NULL values. Here is an example:
>
> in "nulls.csv":
>
> 1,1,1,1
> 2,2,,2
> 3,,3,3
> ,4,4,4
>
> then load the data with ardea:
>
> ardea -d tmp/nulls_test -m a:int,b:int,c:int,d:int -t nulls.csv -v 1
>
> ...yields the following surprising output:
>
> -- begin printing table in tmp/nulls_test --
> Table (on disk) T-nulls_test (tmp/nulls_test) consists of 1 partition with
> 4 columns and 4 rows
> a INT
> b INT
> c INT
> d INT
> 1, 1, 1, 1
> 2, 2, 2147483647, 2147483647
> 3, 2147483647, 2147483647, 2147483647
> 2147483647, 2147483647, 2147483647, 2147483647
> --  end  printing table in tmp/nulls_test --
>
> Notice that once ardea hits a NULL value, all subsequent values in that
> row become NULL. I'm currently using svn rev 582.
>
> Thanks,
>
> Michael
>
>
> On Fri, Nov 2, 2012 at 8:56 PM, K. John Wu <[email protected]> wrote:
>
>> Hi, Michael,
>>
>> I understand your request, however, the current implementation in
>> FastBit deals with group by operations by assuming all values given to
>> them are valid (not NULL).  Clearly, this is a very limited option,
>> however, it covers a good number of use cases.
>>
>> It seems to me that I will need quite a bit of time to add the support
>> for NULL values in group-by operations.  If you have a quick way to
>> get this done, please go ahead with modification or let me know how to
>> do it..
>>
>> John
>>
>>
>> On 11/1/12 4:25 PM, Michael Beauregard wrote:
>> > Thanks for the quick response.
>> >
>> > I was afraid I you were going to suggest that. I think I over
>> > simplified the query making it a poor example of what I intend to do.
>> > I need something more like:
>> >
>> > SELECT a, b, AVG(c), AVG(d), AVG(e), AVG(f), AVG(g), AVG(h), AVG(i),
>> > AVG(j) WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some other
>> > constraints>
>> >
>> > and columns 'c' through 'j' can have NULL values at any point.
>> >
>> > Splitting that into 8 queries:
>> >
>> > SELECT a, b, AVG(c) WHERE c NOT NULL AND ...
>> > SELECT a, b, AVG(d) WHERE d NOT NULL AND ...
>> > ...
>> >
>> > means that I'd have to somehow combine those results after the fact
>> > and I'd like to avoid that if possible.
>> >
>> > Could I perform the main query without specifying 'c' through 'j'
>> > aggregations, then using the api to compute each average on the
>> > resulting table so that the non-aggregation part of the query doesn't
>> > have to be repeated? Like:
>> >
>> > SELECT a, b WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some
>> > other constraints>
>> >
>> > Then use the resulting table instance to compute aggregations AVG('c')
>> > through AVG('j').
>> >
>> > Or any other suggestions?
>> >
>> > Michael
>> >
>> > On Thu, Nov 1, 2012 at 4:08 PM, K. John Wu <[email protected]> wrote:
>> >> You will have to do AVG(a) and AVG(b) separately as follows
>> >>
>> >> SELECT AVG(a) WHERE a not null
>> >> SELECT AVG(b) where b not null
>> >>
>> >> In
>> >>
>> >> SELECT AVG(a), AVG(b) WHERE a not null
>> >>
>> >> there is an implicit 'b not null' also applied, which will give you
>> >> two rows instead of three as in the original example you give.
>> >>
>> >> John
>> >>
>> >>
>> >> On 11/1/12 3:57 PM, Michael Beauregard wrote:
>> >>> I see. I've noticed in past conversations that people generally deal
>> >>> with this by using a magic value instead of NULLs. Which works fine in
>> >>> simple cases like:
>> >>>
>> >>> SELECT AVG(a) WHERE a != <magic>
>> >>>
>> >>> However, I am in a situation where I need the query to return multiple
>> >>> aggregations such as:
>> >>>
>> >>> SELECT AVG(a), AVG(b) WHERE 1 = 1
>> >>>
>> >>> and I would like AVG(a) to average all non-magic values of 'a' and
>> >>> AVG(b) to average all non-magic values of 'b'. Any suggestions on now
>> >>> to go about doing this?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Michael
>> >>>
>> >>> On Thu, Nov 1, 2012 at 3:48 PM, K. John Wu <[email protected]> wrote:
>> >>>> Hi, Michael,
>> >>>>
>> >>>> The NULL values in CSV fiels are not ignored by ardea, but by the
>> >>>> query processing code.  Basically, FastBit query results can not
>> >>>> contain NULLs, so  'SELECT a WHERE 1 = 1' will return three rows
>> while
>> >>>> 'SELECT b WHERE 1 = 1' will only return two.
>> >>>>
>> >>>> John
>> >>>>
>> >>>>
>> >>>> On 11/1/12 1:55 PM, Michael Beauregard wrote:
>> >>>>> Hey John,
>> >>>>>
>> >>>>> I'm a bit confused as to how NULLs in csv files are treated by
>> ardea.
>> >>>>> Here is an experiment I just did:
>> >>>>>
>> >>>>> in "nulls.csv":
>> >>>>>
>> >>>>> 10,1
>> >>>>> 20,
>> >>>>> 30,3
>> >>>>>
>> >>>>> then load the data with ardea:
>> >>>>>
>> >>>>> ardea -d tmp/nulls_test -m a:int,b:int -t nulls.csv
>> >>>>>
>> >>>>> then query with ibis:
>> >>>>>
>> >>>>> ibis.sh -d tmp/nulls_test -q 'SELECT a, b WHERE 1 = 1' -o out.txt
>> >>>>>
>> >>>>> cat out.txt
>> >>>>> 10, 1
>> >>>>> 30, 3
>> >>>>>
>> >>>>> For some reason the csv row with a NULL in it is ignored by ardea.
>> >>>>> This is unexpected for me, but I'm wondering if this is working as
>> >>>>> designed or not. If this is expected behaviour, then I'm wondering
>> how
>> >>>>> I can specify NULL values when loading data.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Michael
>> >>>>> _______________________________________________
>> >>>>> FastBit-users mailing list
>> >>>>> [email protected]
>> >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> >>>>>
>>
>
>

csv-nulls.patch
Description: Binary data

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] csv with NULLs

Reply via email to