Hi,
I would be extremely grateful for any insight how to best do sub-tree
searches using Fastbit.
I will try to describe the problem below in detail and outline a
possible solution using keyword indexes. Unfortunately, I'm getting
"double free or corruption (out)" errors when building keyword indexes.
Description of the problem
-----------------------------------------------------------------------
Each document is identified by a unique id stored in acc column.
Begin Column
name = "acc"
description = acc
data_type = "TEXT"
index=noindex
End Column
Each document is directly associated with one organism (e.g. mouse [mus
musculus]).
There are ~ 500,000 distinct organisms.
There are ~ 2,000,000,000 documents.
The organisms are organised into a k-ary tree, where k is ~ 1000. I.e.
starting from the common root each node of the tree can have up to ~
1000 children.
There are < 1,000,000 nodes in the tree.
The path from root to node is typically ~ 30 nodes deep.
For example, the path from the common root to mouse is
(http://www.ebi.ac.uk/ena/data/view/display=html&Taxon:10090):
root; cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa;
Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata;
Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii;
Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires;
Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus; Mus
musculus.
I have two types of queries I need to support:
(1) Give me all documents which are directly associated with Mouse
(2) Give me all documents which are directly or indirectly associated
with Mammalia
Query (2) is the problematic one. Not only should I get all documents
directly associated with Mammalia but also all documents that are
associated with any of the child nodes of Mammalia, including e.g.
mouse, human, cow, dog, cat.
Trying (but failing) to use a keyword index to solve the problem
-----------------------------------------------------------------------
One way to support this type of query is to build a keyword index. We
have used this strategy previously with Apache Lucene. Using
term-document matrix notation
(http://crd.lbl.gov/~kewu/fastbit/doc/indexSpec.html) this would look
like the following for a single document DOC1 which is directly
associated with mouse:
root: DOC1
cellular organisms: DOC1
Eukaryota: DOC1
Fungi/Metazoa group: DOC1
Metazoa: DOC1
Eumetazoa: DOC1
Bilateria: DOC1
Coelomata: DOC1
Deuterostomia: DOC1
Chordata: DOC1
Craniata: DOC1
Vertebrata: DOC1
Gnathostomata: DOC1
Teleostomi: DOC1
Euteleostomi: DOC1
Sarcopterygii: DOC1
Tetrapoda: DOC1
Amniota: DOC1
Mammalia: DOC1
Theria: DOC1
Eutheria: DOC1
Euarchontoglires: DOC1
Glires: DOC1
Rodentia: DOC1
Sciurognathi: DOC1
Muroidea: DOC1
Muridae: DOC1
Murinae: DOC1
Mus: DOC1
Mus musculus: DOC1
I created a keyword.tdlist file for 1,000,000 documents and copied it
into the same directory as -part.txt. The document ids were in the acc
column. There were a total of 50,000,000 term<->id associations.
Files in the directory:
acc dataclass env moltype keyword.tdlist -part.txt
I added the following block to -part.txt:
Begin Column
name = "keyword"
description = keyword
data_type = "TEXT"
index=keywords, docidname=acc
End Column
I can build all other indexes, but whenever I try to build a keyword
index using 'ibis -b 1 -d <directory>' I get the following error:
*** glibc detected ***
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/examples/.libs/lt-ibis:
double free or corruption (out): 0x0000000038fa4af0 *
**
======= Backtrace: =========
/lib64/libc.so.6[0x31e9e722ef]
/lib64/libc.so.6(cfree+0x4b)[0x31e9e7273b]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZN4ibis11fileManager7storage5clearEv+0x16c)[0x2b65860c767c]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZN4ibis11fileManager7storageD0Ev+0x17)[0x2b65860e7ff7]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZN4ibis7array_tIPKcE10freeMemoryEv+0x1a8)[0x2b6585dbf418]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZN4ibis8keywordsC1EPKNS_6columnES3_PKc+0x5e0)[0x2b65862feaf0]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZN4ibis5index6createEPKNS_6columnEPKcS5_i+0x24b5)[0x2b6585ee88c5]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZNK4ibis6column9loadIndexEPKci+0x1dd)[0x2b6585f2a81d]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src/.libs/libfastbit.so.0(_ZN4ibis4part12buildIndexesEPKci+0x715)[0x2b6585bfd165]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/examples/.libs/lt-ibis[0x428edf]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x31e9e1d994]
/homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/examples/.libs/lt-ibis(_ZNK4ibis7qIntHod5printERSo+0x39)[0x40a329]
======= Memory map: ========
00400000-0043f000 r-xp 00000000 00:2c 6085843149
/net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
/examples/.libs/lt-ibis
0063f000-00640000 rw-p 0003f000 00:2c 6085843149
/net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
/examples/.libs/lt-ibis
00640000-00641000 rw-p 00640000 00:00 0
0d010000-38fe9000 rw-p 0d010000 00:00 0
[heap]
31e9a00000-31e9a1c000 r-xp 00000000 fd:00 320314
/lib64/ld-2.5.so
31e9c1b000-31e9c1c000 r--p 0001b000 fd:00 320314
/lib64/ld-2.5.so
31e9c1c000-31e9c1d000 rw-p 0001c000 fd:00 320314
/lib64/ld-2.5.so
31e9e00000-31e9f4d000 r-xp 00000000 fd:00 320315
/lib64/libc-2.5.so
31e9f4d000-31ea14d000 ---p 0014d000 fd:00 320315
/lib64/libc-2.5.so
31ea14d000-31ea151000 r--p 0014d000 fd:00 320315
/lib64/libc-2.5.so
31ea151000-31ea152000 rw-p 00151000 fd:00 320315
/lib64/libc-2.5.so
31ea152000-31ea157000 rw-p 31ea152000 00:00 0
31eaa00000-31eaa16000 r-xp 00000000 fd:00 320317
/lib64/libpthread-2.5.so
31eaa16000-31eac15000 ---p 00016000 fd:00 320317
/lib64/libpthread-2.5.so
31eac15000-31eac16000 r--p 00015000 fd:00 320317
/lib64/libpthread-2.5.so
31eac16000-31eac17000 rw-p 00016000 fd:00 320317
/lib64/libpthread-2.5.so
31eac17000-31eac1b000 rw-p 31eac17000 00:00 0
31eb200000-31eb207000 r-xp 00000000 fd:00 320318
/lib64/librt-2.5.so
31eb207000-31eb407000 ---p 00007000 fd:00 320318
/lib64/librt-2.5.so
31eb407000-31eb408000 r--p 00007000 fd:00 320318
/lib64/librt-2.5.so
31eb408000-31eb409000 rw-p 00008000 fd:00 320318
/lib64/librt-2.5.so
31f9000000-31f900d000 r-xp 00000000 fd:00 320336
/lib64/libgcc_s-4.1.2-20080825.so.1
31f900d000-31f920d000 ---p 0000d000 fd:00 320336
/lib64/libgcc_s-4.1.2-20080825.so.1
31f920d000-31f920e000 rw-p 0000d000 fd:00 320336
/lib64/libgcc_s-4.1.2-20080825.so.1
350ce00000-350ce82000 r-xp 00000000 fd:00 320002
/lib64/libm-2.5.so
350ce82000-350d081000 ---p 00082000 fd:00 320002
/lib64/libm-2.5.so
350d081000-350d082000 r--p 00081000 fd:00 320002
/lib64/libm-2.5.so
350d082000-350d083000 rw-p 00082000 fd:00 320002
/lib64/libm-2.5.so
350d600000-350d6e6000 r-xp 00000000 fd:01 1836571
/usr/lib64/libstdc++.so.6.0.8
350d6e6000-350d8e5000 ---p 000e6000 fd:01 1836571
/usr/lib64/libstdc++.so.6.0.8
350d8e5000-350d8eb000 r--p 000e5000 fd:01 1836571
/usr/lib64/libstdc++.so.6.0.8
350d8eb000-350d8ee000 rw-p 000eb000 fd:01 1836571
/usr/lib64/libstdc++.so.6.0.8
350d8ee000-350d900000 rw-p 350d8ee000 00:00 0
2b658562b000-2b658562d000 rw-p 2b658562b000 00:00 0
2b658562d000-2b6586454000 r-xp 00000000 00:2c 6080016290
/net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
/src/.libs/libfastbit.so.0.0.9
2b6586454000-2b6586653000 ---p 00e27000 00:2c 6080016290
/net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
/src/.libs/libfastbit.so.0.0.9
2b6586653000-2b6586669000 rw-p 00e26000 00:2c 6080016290
/net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
/src/.libs/libfastbit.so.0.0.9
2b6586682000-2b6586686000 rw-p 2b6586682000 00:00 0
2b6586686000-2b65866bb000 r--s 00000000 fd:03 229378
/var/db/nscd/passwd
2b65866bb000-2b65866bc000 rw-p 2b65866bb000 00:00 0
2b65876bd000-2b6587781000 rw-p 2b65876bd000 00:00 0
7fff2452b000-7fff24540000 rw-p 7ffffffea000 00:00 0
[stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0
[vdso]
Warning -- part::readMetaData found 5 columns, but 4 were expected
/homes/rasko/.lsbatch/1285084434.36561: line 8: 31364
Aborted (core dumped)
../fastbit-ibis1.2.0_install/examples/ibis -b 1 -d test_1000000_1
00000_0.0005_text
The full content of -part.txt is:
# metadata file written by ibis::part::writeMetaData
# on Tue Sep 21 15:51:43 2010 UTC
BEGIN HEADER
Name = "test_1000000_100000_0.0005_text"
Description = "Data initially wrote with ibis::tablex interface on Tue
Sep 21 16:50:42 2010 with 4 columns and 100000 rows"
Number_of_columns = 4
Number_of_rows = 1000000
Timestamp = 1285084256
State = 1
END HEADER
Begin Column
name = "acc"
description = acc
data_type = "TEXT"
index=noindex
End Column
Begin Column
name = "dataclass"
description = dataclass = PAT, HTG, CON, STS, HTC, TSA, GSS, WGS, EST
data_type = "CATEGORY"
minimum = 0
maximum = 1000000
End Column
Begin Column
name = "env"
description = env = Y
data_type = "CATEGORY"
minimum = 0
maximum = 1000000
End Column
Begin Column
name = "moltype"
description = moltype = genomic DNA, mRNA, genomic RNA, tRNA, rRNA,
other DNA, other RNA, unassigned DNA, unassigned RNA, ..., transcribed RNA
data_type = "CATEGORY"
minimum = 0
maximum = 1000000
End Column
Begin Column
name = "keyword"
description = keyword
data_type = "TEXT"
index=keywords, docidname=acc
End Column
Any help is very much appreciated.
Rasko Leinonen
European Nucleotide Archive (ENA)
EMBL-EBI
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users