Re: [FastBit-users] csv with NULLs

Michael Beauregard Tue, 06 Nov 2012 16:28:18 -0800

Thankfully I'm able to avoid that problematic situation altogether.
However, I have found what appears to be an issue with loading data from
csv that has NULL values. Here is an example:


in "nulls.csv":

1,1,1,1
2,2,,2
3,,3,3
,4,4,4

then load the data with ardea:

ardea -d tmp/nulls_test -m a:int,b:int,c:int,d:int -t nulls.csv -v 1

...yields the following surprising output:

-- begin printing table in tmp/nulls_test --
Table (on disk) T-nulls_test (tmp/nulls_test) consists of 1 partition with
4 columns and 4 rows
a INT
b INT
c INT
d INT
1, 1, 1, 1
2, 2, 2147483647, 2147483647
3, 2147483647, 2147483647, 2147483647
2147483647, 2147483647, 2147483647, 2147483647
--  end  printing table in tmp/nulls_test --

Notice that once ardea hits a NULL value, all subsequent values in that row
become NULL. I'm currently using svn rev 582.

Thanks,

Michael


On Fri, Nov 2, 2012 at 8:56 PM, K. John Wu <[email protected]> wrote:

> Hi, Michael,
>
> I understand your request, however, the current implementation in
> FastBit deals with group by operations by assuming all values given to
> them are valid (not NULL).  Clearly, this is a very limited option,
> however, it covers a good number of use cases.
>
> It seems to me that I will need quite a bit of time to add the support
> for NULL values in group-by operations.  If you have a quick way to
> get this done, please go ahead with modification or let me know how to
> do it..
>
> John
>
>
> On 11/1/12 4:25 PM, Michael Beauregard wrote:
> > Thanks for the quick response.
> >
> > I was afraid I you were going to suggest that. I think I over
> > simplified the query making it a poor example of what I intend to do.
> > I need something more like:
> >
> > SELECT a, b, AVG(c), AVG(d), AVG(e), AVG(f), AVG(g), AVG(h), AVG(i),
> > AVG(j) WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some other
> > constraints>
> >
> > and columns 'c' through 'j' can have NULL values at any point.
> >
> > Splitting that into 8 queries:
> >
> > SELECT a, b, AVG(c) WHERE c NOT NULL AND ...
> > SELECT a, b, AVG(d) WHERE d NOT NULL AND ...
> > ...
> >
> > means that I'd have to somehow combine those results after the fact
> > and I'd like to avoid that if possible.
> >
> > Could I perform the main query without specifying 'c' through 'j'
> > aggregations, then using the api to compute each average on the
> > resulting table so that the non-aggregation part of the query doesn't
> > have to be repeated? Like:
> >
> > SELECT a, b WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some
> > other constraints>
> >
> > Then use the resulting table instance to compute aggregations AVG('c')
> > through AVG('j').
> >
> > Or any other suggestions?
> >
> > Michael
> >
> > On Thu, Nov 1, 2012 at 4:08 PM, K. John Wu <[email protected]> wrote:
> >> You will have to do AVG(a) and AVG(b) separately as follows
> >>
> >> SELECT AVG(a) WHERE a not null
> >> SELECT AVG(b) where b not null
> >>
> >> In
> >>
> >> SELECT AVG(a), AVG(b) WHERE a not null
> >>
> >> there is an implicit 'b not null' also applied, which will give you
> >> two rows instead of three as in the original example you give.
> >>
> >> John
> >>
> >>
> >> On 11/1/12 3:57 PM, Michael Beauregard wrote:
> >>> I see. I've noticed in past conversations that people generally deal
> >>> with this by using a magic value instead of NULLs. Which works fine in
> >>> simple cases like:
> >>>
> >>> SELECT AVG(a) WHERE a != <magic>
> >>>
> >>> However, I am in a situation where I need the query to return multiple
> >>> aggregations such as:
> >>>
> >>> SELECT AVG(a), AVG(b) WHERE 1 = 1
> >>>
> >>> and I would like AVG(a) to average all non-magic values of 'a' and
> >>> AVG(b) to average all non-magic values of 'b'. Any suggestions on now
> >>> to go about doing this?
> >>>
> >>> Thanks,
> >>>
> >>> Michael
> >>>
> >>> On Thu, Nov 1, 2012 at 3:48 PM, K. John Wu <[email protected]> wrote:
> >>>> Hi, Michael,
> >>>>
> >>>> The NULL values in CSV fiels are not ignored by ardea, but by the
> >>>> query processing code.  Basically, FastBit query results can not
> >>>> contain NULLs, so  'SELECT a WHERE 1 = 1' will return three rows while
> >>>> 'SELECT b WHERE 1 = 1' will only return two.
> >>>>
> >>>> John
> >>>>
> >>>>
> >>>> On 11/1/12 1:55 PM, Michael Beauregard wrote:
> >>>>> Hey John,
> >>>>>
> >>>>> I'm a bit confused as to how NULLs in csv files are treated by ardea.
> >>>>> Here is an experiment I just did:
> >>>>>
> >>>>> in "nulls.csv":
> >>>>>
> >>>>> 10,1
> >>>>> 20,
> >>>>> 30,3
> >>>>>
> >>>>> then load the data with ardea:
> >>>>>
> >>>>> ardea -d tmp/nulls_test -m a:int,b:int -t nulls.csv
> >>>>>
> >>>>> then query with ibis:
> >>>>>
> >>>>> ibis.sh -d tmp/nulls_test -q 'SELECT a, b WHERE 1 = 1' -o out.txt
> >>>>>
> >>>>> cat out.txt
> >>>>> 10, 1
> >>>>> 30, 3
> >>>>>
> >>>>> For some reason the csv row with a NULL in it is ignored by ardea.
> >>>>> This is unexpected for me, but I'm wondering if this is working as
> >>>>> designed or not. If this is expected behaviour, then I'm wondering
> how
> >>>>> I can specify NULL values when loading data.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Michael
> >>>>> _______________________________________________
> >>>>> FastBit-users mailing list
> >>>>> [email protected]
> >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> >>>>>
>

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] csv with NULLs

Reply via email to