Hey John, I had a quick look at this problem and found a trivial fix for it. Please have a look at the attached patch.
Thanks! Michael On Tue, Nov 6, 2012 at 4:27 PM, Michael Beauregard < [email protected]> wrote: > Thankfully I'm able to avoid that problematic situation altogether. > However, I have found what appears to be an issue with loading data from > csv that has NULL values. Here is an example: > > in "nulls.csv": > > 1,1,1,1 > 2,2,,2 > 3,,3,3 > ,4,4,4 > > then load the data with ardea: > > ardea -d tmp/nulls_test -m a:int,b:int,c:int,d:int -t nulls.csv -v 1 > > ...yields the following surprising output: > > -- begin printing table in tmp/nulls_test -- > Table (on disk) T-nulls_test (tmp/nulls_test) consists of 1 partition with > 4 columns and 4 rows > a INT > b INT > c INT > d INT > 1, 1, 1, 1 > 2, 2, 2147483647, 2147483647 > 3, 2147483647, 2147483647, 2147483647 > 2147483647, 2147483647, 2147483647, 2147483647 > -- end printing table in tmp/nulls_test -- > > Notice that once ardea hits a NULL value, all subsequent values in that > row become NULL. I'm currently using svn rev 582. > > Thanks, > > Michael > > > On Fri, Nov 2, 2012 at 8:56 PM, K. John Wu <[email protected]> wrote: > >> Hi, Michael, >> >> I understand your request, however, the current implementation in >> FastBit deals with group by operations by assuming all values given to >> them are valid (not NULL). Clearly, this is a very limited option, >> however, it covers a good number of use cases. >> >> It seems to me that I will need quite a bit of time to add the support >> for NULL values in group-by operations. If you have a quick way to >> get this done, please go ahead with modification or let me know how to >> do it.. >> >> John >> >> >> On 11/1/12 4:25 PM, Michael Beauregard wrote: >> > Thanks for the quick response. >> > >> > I was afraid I you were going to suggest that. I think I over >> > simplified the query making it a poor example of what I intend to do. >> > I need something more like: >> > >> > SELECT a, b, AVG(c), AVG(d), AVG(e), AVG(f), AVG(g), AVG(h), AVG(i), >> > AVG(j) WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some other >> > constraints> >> > >> > and columns 'c' through 'j' can have NULL values at any point. >> > >> > Splitting that into 8 queries: >> > >> > SELECT a, b, AVG(c) WHERE c NOT NULL AND ... >> > SELECT a, b, AVG(d) WHERE d NOT NULL AND ... >> > ... >> > >> > means that I'd have to somehow combine those results after the fact >> > and I'd like to avoid that if possible. >> > >> > Could I perform the main query without specifying 'c' through 'j' >> > aggregations, then using the api to compute each average on the >> > resulting table so that the non-aggregation part of the query doesn't >> > have to be repeated? Like: >> > >> > SELECT a, b WHERE a NOT IN (1, 2, 3) AND b NOT IN (3, 4, 5) AND <some >> > other constraints> >> > >> > Then use the resulting table instance to compute aggregations AVG('c') >> > through AVG('j'). >> > >> > Or any other suggestions? >> > >> > Michael >> > >> > On Thu, Nov 1, 2012 at 4:08 PM, K. John Wu <[email protected]> wrote: >> >> You will have to do AVG(a) and AVG(b) separately as follows >> >> >> >> SELECT AVG(a) WHERE a not null >> >> SELECT AVG(b) where b not null >> >> >> >> In >> >> >> >> SELECT AVG(a), AVG(b) WHERE a not null >> >> >> >> there is an implicit 'b not null' also applied, which will give you >> >> two rows instead of three as in the original example you give. >> >> >> >> John >> >> >> >> >> >> On 11/1/12 3:57 PM, Michael Beauregard wrote: >> >>> I see. I've noticed in past conversations that people generally deal >> >>> with this by using a magic value instead of NULLs. Which works fine in >> >>> simple cases like: >> >>> >> >>> SELECT AVG(a) WHERE a != <magic> >> >>> >> >>> However, I am in a situation where I need the query to return multiple >> >>> aggregations such as: >> >>> >> >>> SELECT AVG(a), AVG(b) WHERE 1 = 1 >> >>> >> >>> and I would like AVG(a) to average all non-magic values of 'a' and >> >>> AVG(b) to average all non-magic values of 'b'. Any suggestions on now >> >>> to go about doing this? >> >>> >> >>> Thanks, >> >>> >> >>> Michael >> >>> >> >>> On Thu, Nov 1, 2012 at 3:48 PM, K. John Wu <[email protected]> wrote: >> >>>> Hi, Michael, >> >>>> >> >>>> The NULL values in CSV fiels are not ignored by ardea, but by the >> >>>> query processing code. Basically, FastBit query results can not >> >>>> contain NULLs, so 'SELECT a WHERE 1 = 1' will return three rows >> while >> >>>> 'SELECT b WHERE 1 = 1' will only return two. >> >>>> >> >>>> John >> >>>> >> >>>> >> >>>> On 11/1/12 1:55 PM, Michael Beauregard wrote: >> >>>>> Hey John, >> >>>>> >> >>>>> I'm a bit confused as to how NULLs in csv files are treated by >> ardea. >> >>>>> Here is an experiment I just did: >> >>>>> >> >>>>> in "nulls.csv": >> >>>>> >> >>>>> 10,1 >> >>>>> 20, >> >>>>> 30,3 >> >>>>> >> >>>>> then load the data with ardea: >> >>>>> >> >>>>> ardea -d tmp/nulls_test -m a:int,b:int -t nulls.csv >> >>>>> >> >>>>> then query with ibis: >> >>>>> >> >>>>> ibis.sh -d tmp/nulls_test -q 'SELECT a, b WHERE 1 = 1' -o out.txt >> >>>>> >> >>>>> cat out.txt >> >>>>> 10, 1 >> >>>>> 30, 3 >> >>>>> >> >>>>> For some reason the csv row with a NULL in it is ignored by ardea. >> >>>>> This is unexpected for me, but I'm wondering if this is working as >> >>>>> designed or not. If this is expected behaviour, then I'm wondering >> how >> >>>>> I can specify NULL values when loading data. >> >>>>> >> >>>>> Thanks, >> >>>>> >> >>>>> Michael >> >>>>> _______________________________________________ >> >>>>> FastBit-users mailing list >> >>>>> [email protected] >> >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> >>>>> >> > >
csv-nulls.patch
Description: Binary data
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
