Oleg,

Regarding your original question about FastBit performance, I believe
that Steve to be on the right track when you asks how much time is
spent on reading the output values.  The gist is that since you are
asking for output of all columns, it will take time to read those values.

When you output those values without indexes, one simply reads the
original data files examine the values and output those satisfy the
required conditions.  In this process, the original data files should
be read only once.

However, if one is processing the same query with indexes, one would
have to to read the indexes first and then the raw data files to get
the values.  Potentially, more files will be touched -- typically less
bytes will be read, which explains what you have observed that FastBit
is able to use less time.  However, the time needed to read the output
values is not avoidable and in your case could dominate the query
processing time used by FastBit -- which is what Steve was asking.

If the output values are somewhat scattered in the data files, which
is very likely to be the case, then the effective I/O speed would be
relatively low for FastBit.  In contrast, the simpler method would be
reading all the values in sequential mode, which typically has a
higher I/O speed.

Your observation regarding drop_caches test probably could be
explained by the fact that FastBit is touching more files and reading
them in more complex ways.  If you force the system to forget what it
has done before, it will need to establish the more complex access
patten again, which will take more time.  I hope that you could let
the virtual memory system do its thing with the page tables, which
would avoid the extra time with FastBit.

Hope this helps.

John



On 10/21/15 12:20 PM, Enns, Steven wrote:
> I¹m curious to see roughly how much query time is spent reading selected
> column data vs evaluating the condition.  Maybe try to select just a
> single column with the same where clause.
> 
> 
> On 10/21/15, 10:04 AM, "[email protected] on behalf of
> Oleg Gawriloff" <[email protected] on behalf of
> [email protected]> wrote:
> 
>> We are looking for some DB for our ITAS system (Internet Traffic Archive
>> System). Requirements are pretty straightforward: all data are numeric,
>> very large quantity of strings (8G per day), only thing we need is fast
>> search on that by limited number of fields. So, after some research I
>> found out that fastbit used in similar systems at ntop/solarwinds NTA
>> projects and performed some tests to clarify whether is it good or not.
>> Results seems very strange to me:
>>
>> We have our test data, 131M strings. All data are numeric. I converted
>> them to csv from binary format used by our netflow-collector
>> (flow-tools) and put them in fastbit using:
>>
>> ardea -d /var/tmp/backup/fastbit/tmp2.1 -m "DPKTS:unsigned
>> int,DOCTETS:unsigned int,FIRST:unsigned int,LAST:unsigned
>> int,SRCADDR:unsigned int,DSTADDR:unsigned int,NETADDR:unsigned
>> int,SRCPORT:unsigned short,DSTPORT:unsigned short,PROTO:unsigned
>> byte,NATPORT:unsigned short,DSTAS:unsigned short" -t one2.1.csv
>> ardea read 131472840 rows from one3.1.csv
>>
>> ardea -- duration: 171.51 sec(CPU), 210.203 sec(elapsed)
>>
>> after that I performed simple search 10 times like that:
>>
>> time ibis -d tmp3.1 -q "SELECT
>> FIRST,LAST,PROTO,SRCADDR,SRCPORT,DSTADDR,DSTPORT,NETADDR,NATPORT,DSTAS,DOC
>> TETS,DPKTS FROM tmp3.1 WHERE FIRST>1441054677 AND LAST<1441065957 AND
>> SRCADDR=1481989497 or DSTADDR=1481989497" -output res.txt
>>
>> doQuaere -- "SELECT
>> FIRST,LAST,PROTO,SRCADDR,SRCPORT,DSTADDR,DSTPORT,NETADDR,NATPORT,DSTAS,DOC
>> TETS,DPKTS FROM tmp3.1 WHERE FIRST>1441054677 AND LAST<1441065957 AND
>> SRCADDR=1481989497 or DSTADDR=1481989497" produced a table with 6309 rows
>> and 12 columns
>>
>> real    0m2.258s
>>
>> First time was a long one, because of index creation, but others show
>> similar executions time (2.5sec).
>>
>> The problem that linear search by flow-tools on the same data shows that
>> fastbit only 6 times faster.
>>
>>  time flow-cat ft_uncompressed3* | flow-nfilter -f nfilter.cfg -F
>> F_TIME_IP |  flow-export -f 2
>> -mDPKTS,DOCTETS,FIRST,LAST,SRCADDR,DSTADDR,NEXTHOP,SRCPORT,DSTPORT,PROT,SR
>> C_AS,DST_AS > res.txt
>> flow-export: Exported 6309 records
>>
>> real    0m12.083s
>>
>>
>> Which is worse - when I drop file cache flow-tools search time does not
>> change, but fastbit jumps to 9 sec from 2.
>>
>> albatros2 fastbit # echo 3 > /proc/sys/vm/drop_caches && time ibis -d
>> tmp3.1 -q "SELECT 
>> FIRST,LAST,PROTO,SRCADDR,SRCPORT,DSTADDR,DSTPORT,NETADDR,NATPORT,DSTAS,DOC
>> TETS,DPKTS FROM tmp3.1 WHERE FIRST>1441054677 AND LAST<1441065957 AND
>> SRCADDR=1481989497 or DSTADDR=1481989497" -output res.txt
>>
>> doQuaere -- "SELECT
>> FIRST,LAST,PROTO,SRCADDR,SRCPORT,DSTADDR,DSTPORT,NETADDR,NATPORT,DSTAS,DOC
>> TETS,DPKTS FROM tmp3.1 WHERE FIRST>1441054677 AND LAST<1441065957 AND
>> SRCADDR=1481989497 or DSTADDR=1481989497" produced a table with 6309 rows
>> and 12 columns
>>
>> real    0m9.744s
>>
>>
>> I thought in either case results will be much better. May be I miss smth?
>>
>>
>> -- 
>> Signed, Oleg Gawriloff.
>>
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to