Re: [FastBit-users] Segmentation fault with alias in where clause

K. John Wu Fri, 12 Oct 2012 08:38:21 -0700

Hi, Petr,

Yes, you are on the right track.


I see that you are concerned about the memory usage during the first
phase for computing the aggregations.  In this case, you can choose to
use separable operators for aggregation, basically avoiding median and
distinct.  When separable operations for aggregation, the evaluation
can proceed by parts.  This way the aggregations can be completed with
one pass through the data and the memory requirement is limited to
storing necessary columns of a single partition and the result.

You can control the memory usage by keep the partitions reasonably
small and the number of results relatively small.

Hope this helps.

John



On 10/12/12 5:16 AM, Petr Velan wrote:
> Hi John,
> 
> Thank you for the detailed explanation. I understand your point
> regarding aggregation.
> 
> Imagine a case where I want to filter the results by some computed
> column. Then I also need to do two-step query, one to create the
> columns, second to filter them. And this could be a problem for large
> tables, since adding new columns in a first phase will enlarge the
> intermediate table and everything must be read to memory before phase
> to - the filtering. So the least memory demanding solution would be to
> create the table from partition, so that it is not yet read to memory.
> Then use cursor to read line by line, calculate the required columns
> myself and use them to decide whether to use the line or not. This way
> I need almost no memory at all.
> 
> Could you confirm that my reasoning is correct? I might still use
> FastBit to do this kind of filtering just to simplify things unless I
> actually run into memory problems. Then I would have to switch to the
> hand-crafted filtering.
> 
> Thanks,
> Petr
> 
> On 8 October 2012 01:19, K. John Wu <[email protected]> wrote:
>> Hi, Petr,
>>
>> The having clause is essentially telling DBMS to build an in-memory
>> data table based on the traditional part of the query and then perform
>> further evaluation on the in-memory part.  The DBMS might have fancier
>> implementations of this that requires less memory, but the logic is
>> the same.
>>
>> Regarding the memory usage concern, there are two aspects to it: the
>> memory require to generate the intermediate data table and the
>> intermediate data table itself.
>>
>> The intermediate data table is generated through a set of aggregation
>> operations.  These aggregation operations can be divided into
>> separable and non-separable types.  The operator median and
>> count-distinct are two prime examples of non-separable ones.  To
>> evaluate these operations, one has to accumulate all the data records
>> from different data partitions together into one place - this is very
>> expensive in terms of memory requirement.  Most of the other
>> operations are either separable or can be transformed into separable.
>>  For example, SUM can be computed in parts, which is a good example of
>> a separable aggregation operation.  The operation AVG is not a
>> separable operation on its own, however, it can be transformed to be
>> SUM(....) / count(*), which can be evaluated in parts, i.e., they are
>> separable.  The existing FastBit software already does this type of
>> conversion.  The separable functions can be performed in parts and
>> therefore require less memory.  Therefore, the memory constraints
>> during the evaluation of the aggregations should be manageable as long
>> as you can avoid non-separable operations.
>>
>> The second factor in the memory concern is that the intermediate data
>> table might be large.  In your specific example, the intermediate
>> table has only a single row.  Therefore, the memory required for this
>> table could not be a problem.  In general, you can control the
>> intermediate table size by controlling the granularity of the group-by
>> operations.  For example
>>
>> Select e0id2, sum(e0id1), count(*) where e0id3 > 1000000
>>
>> will have as many rows as the number of distinct values of variable
>> e0id2.  It is straightforward to reduce this number by a factor of 10
>> by doing the following
>>
>> Select floor(e0id2/10), sum(e0i21), count(*) where e0id3 > 1000000
>>
>> You can reduce this further if it makes sense to your application
>>
>> Select floor(log(e0id2)/log(2)), sum(e0i21), count(*) where e0id3 >
>> 1000000
>>
>>
>> Hope this helps.
>>
>> John
>>
>>
>>
>> On 10/7/12 5:38 AM, Petr Velan wrote:
>>> Hi John,
>>>
>>> so what I need to perform this kind of queries is something like
>>> HAVING clause? Or I can just take the resulting table and run the
>>> filter, right? But in such case I'm worried about memory requirements.
>>> To be a little more specific, I can do
>>>
>>> result1 = Select sum(e0id1) as bytes, count(*) as count from testdata
>>> where bytes = 40
>>> result2 = Select bytes, count from result1 where bytes = 40
>>>
>>> But the result1 table needs to fit into the memory, right? So my
>>> question is, is there any other way to perform this kind of query
>>> without higher memory requirements?
>>>
>>> Thanks,
>>> Petr
>>>
>>> On 6 October 2012 06:43, K. John Wu <[email protected]> wrote:
>>>> Hi, Petr,
>>>>
>>>> The behavior of FastBit in this case is not graceful, however, the
>>>> query is not well formed.  Let me explain.
>>>>
>>>> Your query can be rewritten as
>>>>
>>>> Select sum(e0id1) as bytes, count(*) from testdata where bytes = 40
>>>>
>>>> The basic problem is that bytes is the result of the query, but it is
>>>> also used as part of the where clause.  There is no way to evaluate
>>>> the where clause because it depends on the result of the query.  Oh,
>>>> well, hope it makes sense.
>>>>
>>>> Will see what I can make the error message more obvious..
>>>>
>>>> John
>>>>
>>>>
>>>> On 10/5/12 1:10 AM, Petr Velan wrote:
>>>>> Hi John,
>>>>>
>>>>> I decided to take advantage of named columns in select clause. I tried to 
>>>>> use
>>>>>
>>>>> table->select("sum(e0id1) as bytes, count(*) as fl", "bytes = 40");
>>>>>
>>>>> but the result is segmentation fault. Is it possible to use named
>>>>> columns in the where clause?
>>>>> Please find the test data and short program attached.
>>>>>
>>>>> Best regards,
>>>>> Petr
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> FastBit-users mailing list
>>>>> [email protected]
>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Segmentation fault with alias in where clause

Reply via email to