Thankfully I'm able to avoid that problematic situation altogether. However, I have found what appears to be an issue with loading data from csv that has NULL values. Here is an example:
in "nulls.csv": 1,1,1,1 2,2,,2 3,,3,3 ,4,4,4 then load the data with ardea: ardea -d tmp/nulls_test -m a:int,b:int,c:int,d:int -t nulls.csv -v 1 ...yields the following surprising output: -- begin printing table in tmp/nulls_test -- Table (on disk) T-nulls_test (tmp/nulls_test) consists of 1 partition with 4 columns and 4 rows a INT b INT c INT d INT 1, 1, 1, 1 2, 2, 2147483647, 2147483647 3, 2147483647, 2147483647, 2147483647 2147483647, 2147483647, 2147483647, 2147483647 -- end printing table in tmp/nulls_test -- Notice that once ardea hits a NULL value, all subsequent values in that row become NULL. I'm currently using svn rev 582. Thanks, Michael On Fri, Nov 2, 2012 at 8:56 PM, K. John Wu <[email protected]> wrote: > Hi, Michael, > > I understand your request, however, the current implementation in > FastBit deals with group by operations by assuming all values given to > them are valid (not NULL). Clearly, this is a very limited option, > however, it covers a good number of use cases. > > It seems to me that I will need quite a bit of time to add the support > for NULL values in group-by operations. If you have a quick way to > get this done, please go ahead with modification or let me know how to > do it.. > > John > > > On 11/1/12 4:25 PM, Michael Beauregard wrote: > > Thanks for the quick response. > > > > I was afraid I you were going to suggest that. I think I over > > simplified the query making it a poor example of what I intend to do. > > I need something more like: > > > > SELECT a, b, AVG(c), AVG(d), AVG(e), AVG(f), AVG(g), AVG(h), AVG(i), > > AVG(j) WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some other > > constraints> > > > > and columns 'c' through 'j' can have NULL values at any point. > > > > Splitting that into 8 queries: > > > > SELECT a, b, AVG(c) WHERE c NOT NULL AND ... > > SELECT a, b, AVG(d) WHERE d NOT NULL AND ... > > ... > > > > means that I'd have to somehow combine those results after the fact > > and I'd like to avoid that if possible. > > > > Could I perform the main query without specifying 'c' through 'j' > > aggregations, then using the api to compute each average on the > > resulting table so that the non-aggregation part of the query doesn't > > have to be repeated? Like: > > > > SELECT a, b WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some > > other constraints> > > > > Then use the resulting table instance to compute aggregations AVG('c') > > through AVG('j'). > > > > Or any other suggestions? > > > > Michael > > > > On Thu, Nov 1, 2012 at 4:08 PM, K. John Wu <[email protected]> wrote: > >> You will have to do AVG(a) and AVG(b) separately as follows > >> > >> SELECT AVG(a) WHERE a not null > >> SELECT AVG(b) where b not null > >> > >> In > >> > >> SELECT AVG(a), AVG(b) WHERE a not null > >> > >> there is an implicit 'b not null' also applied, which will give you > >> two rows instead of three as in the original example you give. > >> > >> John > >> > >> > >> On 11/1/12 3:57 PM, Michael Beauregard wrote: > >>> I see. I've noticed in past conversations that people generally deal > >>> with this by using a magic value instead of NULLs. Which works fine in > >>> simple cases like: > >>> > >>> SELECT AVG(a) WHERE a != <magic> > >>> > >>> However, I am in a situation where I need the query to return multiple > >>> aggregations such as: > >>> > >>> SELECT AVG(a), AVG(b) WHERE 1 = 1 > >>> > >>> and I would like AVG(a) to average all non-magic values of 'a' and > >>> AVG(b) to average all non-magic values of 'b'. Any suggestions on now > >>> to go about doing this? > >>> > >>> Thanks, > >>> > >>> Michael > >>> > >>> On Thu, Nov 1, 2012 at 3:48 PM, K. John Wu <[email protected]> wrote: > >>>> Hi, Michael, > >>>> > >>>> The NULL values in CSV fiels are not ignored by ardea, but by the > >>>> query processing code. Basically, FastBit query results can not > >>>> contain NULLs, so 'SELECT a WHERE 1 = 1' will return three rows while > >>>> 'SELECT b WHERE 1 = 1' will only return two. > >>>> > >>>> John > >>>> > >>>> > >>>> On 11/1/12 1:55 PM, Michael Beauregard wrote: > >>>>> Hey John, > >>>>> > >>>>> I'm a bit confused as to how NULLs in csv files are treated by ardea. > >>>>> Here is an experiment I just did: > >>>>> > >>>>> in "nulls.csv": > >>>>> > >>>>> 10,1 > >>>>> 20, > >>>>> 30,3 > >>>>> > >>>>> then load the data with ardea: > >>>>> > >>>>> ardea -d tmp/nulls_test -m a:int,b:int -t nulls.csv > >>>>> > >>>>> then query with ibis: > >>>>> > >>>>> ibis.sh -d tmp/nulls_test -q 'SELECT a, b WHERE 1 = 1' -o out.txt > >>>>> > >>>>> cat out.txt > >>>>> 10, 1 > >>>>> 30, 3 > >>>>> > >>>>> For some reason the csv row with a NULL in it is ignored by ardea. > >>>>> This is unexpected for me, but I'm wondering if this is working as > >>>>> designed or not. If this is expected behaviour, then I'm wondering > how > >>>>> I can specify NULL values when loading data. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Michael > >>>>> _______________________________________________ > >>>>> FastBit-users mailing list > >>>>> [email protected] > >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > >>>>> >
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
