Re: [FastBit-users] csv with NULLs

K. John Wu Tue, 13 Nov 2012 13:38:37 -0800

Hi, Michael,

Sorry for the delayed response to your questions and suggestions.  It
appears that I have to split the tests in ibis::util::readInt and
ibis::util::readUInt the same way as it is in ibis::util::readDouble.
 This change should address your test case correctly.  The updated
source code is checked in as SVN revision 601.  Please give it a try
when you get the chance.


Thanks.

John


On 11/7/12 4:20 PM, Michael Beauregard wrote:
> Hey John,
> 
> I had a quick look at this problem and found a trivial fix for it.
> Please have a look at the attached patch.
> 
> Thanks!
> 
> Michael
> 
> 
> On Tue, Nov 6, 2012 at 4:27 PM, Michael Beauregard
> <[email protected] <mailto:[email protected]>> wrote:
> 
>     Thankfully I'm able to avoid that problematic situation
>     altogether. However, I have found what appears to be an issue with
>     loading data from csv that has NULL values. Here is an example:
> 
>     in "nulls.csv":
> 
>     1,1,1,1
>     2,2,,2
>     3,,3,3
>     ,4,4,4
> 
>     then load the data with ardea:
> 
>     ardea -d tmp/nulls_test -m a:int,b:int,c:int,d:int -t nulls.csv -v 1
> 
>     ...yields the following surprising output:
> 
>     -- begin printing table in tmp/nulls_test --
>     Table (on disk) T-nulls_test (tmp/nulls_test) consists of 1
>     partition with 4 columns and 4 rows
>     aINT
>     bINT
>     cINT
>     dINT
>     1, 1, 1, 1
>     2, 2, 2147483647 <tel:2147483647>, 2147483647 <tel:2147483647>
>     3, 2147483647 <tel:2147483647>, 2147483647 <tel:2147483647>,
>     2147483647 <tel:2147483647>
>     2147483647 <tel:2147483647>, 2147483647 <tel:2147483647>,
>     2147483647 <tel:2147483647>, 2147483647 <tel:2147483647>
>     --  end  printing table in tmp/nulls_test --
> 
>     Notice that once ardea hits a NULL value, all subsequent values in
>     that row become NULL. I'm currently using svn rev 582.
> 
>     Thanks,
> 
>     Michael
> 
> 
>     On Fri, Nov 2, 2012 at 8:56 PM, K. John Wu <[email protected]
>     <mailto:[email protected]>> wrote:
> 
>         Hi, Michael,
> 
>         I understand your request, however, the current implementation in
>         FastBit deals with group by operations by assuming all values
>         given to
>         them are valid (not NULL).  Clearly, this is a very limited
>         option,
>         however, it covers a good number of use cases.
> 
>         It seems to me that I will need quite a bit of time to add the
>         support
>         for NULL values in group-by operations.  If you have a quick
>         way to
>         get this done, please go ahead with modification or let me
>         know how to
>         do it..
> 
>         John
> 
> 
>         On 11/1/12 4:25 PM, Michael Beauregard wrote:
>         > Thanks for the quick response.
>         >
>         > I was afraid I you were going to suggest that. I think I over
>         > simplified the query making it a poor example of what I
>         intend to do.
>         > I need something more like:
>         >
>         > SELECT a, b, AVG(c), AVG(d), AVG(e), AVG(f), AVG(g), AVG(h),
>         AVG(i),
>         > AVG(j) WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND
>         <some other
>         > constraints>
>         >
>         > and columns 'c' through 'j' can have NULL values at any point.
>         >
>         > Splitting that into 8 queries:
>         >
>         > SELECT a, b, AVG(c) WHERE c NOT NULL AND ...
>         > SELECT a, b, AVG(d) WHERE d NOT NULL AND ...
>         > ...
>         >
>         > means that I'd have to somehow combine those results after
>         the fact
>         > and I'd like to avoid that if possible.
>         >
>         > Could I perform the main query without specifying 'c'
>         through 'j'
>         > aggregations, then using the api to compute each average on the
>         > resulting table so that the non-aggregation part of the
>         query doesn't
>         > have to be repeated? Like:
>         >
>         > SELECT a, b WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5)
>         AND <some
>         > other constraints>
>         >
>         > Then use the resulting table instance to compute
>         aggregations AVG('c')
>         > through AVG('j').
>         >
>         > Or any other suggestions?
>         >
>         > Michael
>         >
>         > On Thu, Nov 1, 2012 at 4:08 PM, K. John Wu <[email protected]
>         <mailto:[email protected]>> wrote:
>         >> You will have to do AVG(a) and AVG(b) separately as follows
>         >>
>         >> SELECT AVG(a) WHERE a not null
>         >> SELECT AVG(b) where b not null
>         >>
>         >> In
>         >>
>         >> SELECT AVG(a), AVG(b) WHERE a not null
>         >>
>         >> there is an implicit 'b not null' also applied, which will
>         give you
>         >> two rows instead of three as in the original example you give.
>         >>
>         >> John
>         >>
>         >>
>         >> On 11/1/12 3:57 PM, Michael Beauregard wrote:
>         >>> I see. I've noticed in past conversations that people
>         generally deal
>         >>> with this by using a magic value instead of NULLs. Which
>         works fine in
>         >>> simple cases like:
>         >>>
>         >>> SELECT AVG(a) WHERE a != <magic>
>         >>>
>         >>> However, I am in a situation where I need the query to
>         return multiple
>         >>> aggregations such as:
>         >>>
>         >>> SELECT AVG(a), AVG(b) WHERE 1 = 1
>         >>>
>         >>> and I would like AVG(a) to average all non-magic values of
>         'a' and
>         >>> AVG(b) to average all non-magic values of 'b'. Any
>         suggestions on now
>         >>> to go about doing this?
>         >>>
>         >>> Thanks,
>         >>>
>         >>> Michael
>         >>>
>         >>> On Thu, Nov 1, 2012 at 3:48 PM, K. John Wu <[email protected]
>         <mailto:[email protected]>> wrote:
>         >>>> Hi, Michael,
>         >>>>
>         >>>> The NULL values in CSV fiels are not ignored by ardea,
>         but by the
>         >>>> query processing code.  Basically, FastBit query results
>         can not
>         >>>> contain NULLs, so  'SELECT a WHERE 1 = 1' will return
>         three rows while
>         >>>> 'SELECT b WHERE 1 = 1' will only return two.
>         >>>>
>         >>>> John
>         >>>>
>         >>>>
>         >>>> On 11/1/12 1:55 PM, Michael Beauregard wrote:
>         >>>>> Hey John,
>         >>>>>
>         >>>>> I'm a bit confused as to how NULLs in csv files are
>         treated by ardea.
>         >>>>> Here is an experiment I just did:
>         >>>>>
>         >>>>> in "nulls.csv":
>         >>>>>
>         >>>>> 10,1
>         >>>>> 20,
>         >>>>> 30,3
>         >>>>>
>         >>>>> then load the data with ardea:
>         >>>>>
>         >>>>> ardea -d tmp/nulls_test -m a:int,b:int -t nulls.csv
>         >>>>>
>         >>>>> then query with ibis:
>         >>>>>
>         >>>>> ibis.sh -d tmp/nulls_test -q 'SELECT a, b WHERE 1 = 1'
>         -o out.txt
>         >>>>>
>         >>>>> cat out.txt
>         >>>>> 10, 1
>         >>>>> 30, 3
>         >>>>>
>         >>>>> For some reason the csv row with a NULL in it is ignored
>         by ardea.
>         >>>>> This is unexpected for me, but I'm wondering if this is
>         working as
>         >>>>> designed or not. If this is expected behaviour, then I'm
>         wondering how
>         >>>>> I can specify NULL values when loading data.
>         >>>>>
>         >>>>> Thanks,
>         >>>>>
>         >>>>> Michael
>         >>>>> _______________________________________________
>         >>>>> FastBit-users mailing list
>         >>>>> [email protected]
>         <mailto:[email protected]>
>         >>>>>
>         https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>         >>>>>
> 
> 
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] csv with NULLs

Reply via email to