Re: [FastBit-users] Segmentation fault with alias in where clause

K. John Wu Sun, 07 Oct 2012 16:19:29 -0700

Hi, Petr,

The having clause is essentially telling DBMS to build an in-memory
data table based on the traditional part of the query and then perform
further evaluation on the in-memory part.  The DBMS might have fancier
implementations of this that requires less memory, but the logic is
the same.

Regarding the memory usage concern, there are two aspects to it: the
memory require to generate the intermediate data table and the
intermediate data table itself.

The intermediate data table is generated through a set of aggregation
operations.  These aggregation operations can be divided into
separable and non-separable types.  The operator median and
count-distinct are two prime examples of non-separable ones.  To
evaluate these operations, one has to accumulate all the data records
from different data partitions together into one place - this is very
expensive in terms of memory requirement.  Most of the other
operations are either separable or can be transformed into separable.
 For example, SUM can be computed in parts, which is a good example of
a separable aggregation operation.  The operation AVG is not a
separable operation on its own, however, it can be transformed to be
SUM(....) / count(*), which can be evaluated in parts, i.e., they are
separable.  The existing FastBit software already does this type of
conversion.  The separable functions can be performed in parts and
therefore require less memory.  Therefore, the memory constraints
during the evaluation of the aggregations should be manageable as long
as you can avoid non-separable operations.

The second factor in the memory concern is that the intermediate data
table might be large.  In your specific example, the intermediate
table has only a single row.  Therefore, the memory required for this
table could not be a problem.  In general, you can control the
intermediate table size by controlling the granularity of the group-by
operations.  For example

Select e0id2, sum(e0id1), count(*) where e0id3 > 1000000

will have as many rows as the number of distinct values of variable
e0id2.  It is straightforward to reduce this number by a factor of 10
by doing the following

Select floor(e0id2/10), sum(e0i21), count(*) where e0id3 > 1000000

You can reduce this further if it makes sense to your application

Select floor(log(e0id2)/log(2)), sum(e0i21), count(*) where e0id3 >
1000000

Hope this helps.

John

On 10/7/12 5:38 AM, Petr Velan wrote:
> Hi John,
> 
> so what I need to perform this kind of queries is something like
> HAVING clause? Or I can just take the resulting table and run the
> filter, right? But in such case I'm worried about memory requirements.
> To be a little more specific, I can do
> 
> result1 = Select sum(e0id1) as bytes, count(*) as count from testdata
> where bytes = 40
> result2 = Select bytes, count from result1 where bytes = 40
> 
> But the result1 table needs to fit into the memory, right? So my
> question is, is there any other way to perform this kind of query
> without higher memory requirements?
> 
> Thanks,
> Petr
> 
> On 6 October 2012 06:43, K. John Wu <[email protected]> wrote:
>> Hi, Petr,
>>
>> The behavior of FastBit in this case is not graceful, however, the
>> query is not well formed.  Let me explain.
>>
>> Your query can be rewritten as
>>
>> Select sum(e0id1) as bytes, count(*) from testdata where bytes = 40
>>
>> The basic problem is that bytes is the result of the query, but it is
>> also used as part of the where clause.  There is no way to evaluate
>> the where clause because it depends on the result of the query.  Oh,
>> well, hope it makes sense.
>>
>> Will see what I can make the error message more obvious..
>>
>> John
>>
>>
>> On 10/5/12 1:10 AM, Petr Velan wrote:
>>> Hi John,
>>>
>>> I decided to take advantage of named columns in select clause. I tried to 
>>> use
>>>
>>> table->select("sum(e0id1) as bytes, count(*) as fl", "bytes = 40");
>>>
>>> but the result is segmentation fault. Is it possible to use named
>>> columns in the where clause?
>>> Please find the test data and short program attached.
>>>
>>> Best regards,
>>> Petr
>>>
>>>
>>>
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Segmentation fault with alias in where clause

Reply via email to