Hi, Rasko,

Looks like we have neglected to keyword index for a while and we have 
addressed this problem by adding two small test cases in the 
regression test suite.  The current code in the SVN (revision 326) 
passes the regression tests.  Please give the new code a try when you 
get the chance and let us know how it works for you.  Thanks.

John

PS: The source code is available through the following svn command

svn checkout https://codeforge.lbl.gov/anonscm/fastbit

Note that this repository is a mirror of the internal SVN repository, 
if you don't get revision 326 (or newer), you might have to wait up to 
an hour for the mirroring to occur.



On 9/22/2010 2:06 AM, Rasko Leinonen wrote:
>   Hi,
>
> I would be extremely grateful for any insight how to best do sub-tree
> searches using Fastbit.
>
> I will try to describe the problem below in detail and outline a
> possible solution using keyword indexes. Unfortunately, I'm getting
> "double free or corruption (out)" errors when building keyword indexes.
>
>
> Description of the problem
> -----------------------------------------------------------------------
>
> Each document is identified by a unique id stored in acc column.
>
> Begin Column
> name = "acc"
> description = acc
> data_type = "TEXT"
> index=noindex
> End Column
>
> Each document is directly associated with one organism (e.g. mouse
> [mus musculus]).
>
> There are ~ 500,000 distinct organisms.
>
> There are ~ 2,000,000,000 documents.
>
> The organisms are organised into a k-ary tree, where k is ~ 1000. I.e.
> starting from the common root each node of the tree can have up to ~
> 1000 children.
>
> There are < 1,000,000 nodes in the tree.
>
> The path from root to node is typically ~ 30 nodes deep.
>
> For example, the path from the common root to mouse is
> (http://www.ebi.ac.uk/ena/data/view/display=html&Taxon:10090
> <http://www.ebi.ac.uk/ena/data/view/display=html&Taxon:10090>):
>
> root; cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa;
> Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata;
> Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii;
> Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires;
> Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus; Mus
> musculus.
>
> I have two types of queries I need to support:
>
> (1) Give me all documents which are directly associated with Mouse
> (2) Give me all documents which are directly or indirectly associated
> with Mammalia
>
> Query (2) is the problematic one. Not only should I get all documents
> directly associated with Mammalia but also all documents that are
> associated with any of the child nodes of Mammalia, including e.g.
> mouse, human, cow, dog, cat.
>
>
> Trying (but failing) to use a keyword index to solve the problem
> -----------------------------------------------------------------------
>
> One way to support this type of query is to build a keyword index. We
> have used this strategy previously with Apache Lucene. Using
> term-document matrix notation
> (http://crd.lbl.gov/~kewu/fastbit/doc/indexSpec.html
> <http://crd.lbl.gov/%7Ekewu/fastbit/doc/indexSpec.html>) this would
> look like the following for a single document DOC1 which is directly
> associated with mouse:
>
> root: DOC1
> cellular organisms: DOC1
> Eukaryota: DOC1
> Fungi/Metazoa group: DOC1
> Metazoa: DOC1
> Eumetazoa: DOC1
> Bilateria: DOC1
> Coelomata: DOC1
> Deuterostomia: DOC1
> Chordata: DOC1
> Craniata: DOC1
> Vertebrata: DOC1
> Gnathostomata: DOC1
> Teleostomi: DOC1
> Euteleostomi: DOC1
> Sarcopterygii: DOC1
> Tetrapoda: DOC1
> Amniota: DOC1
> Mammalia: DOC1
> Theria: DOC1
> Eutheria: DOC1
> Euarchontoglires: DOC1
> Glires: DOC1
> Rodentia: DOC1
> Sciurognathi: DOC1
> Muroidea: DOC1
> Muridae: DOC1
> Murinae: DOC1
> Mus: DOC1
> Mus musculus: DOC1
>
> I created a keyword.tdlist file for 1,000,000 documents and copied it
> into the same directory as -part.txt. The document ids were in the acc
> column. There were a total of 50,000,000 term<->id associations.
>
> Files in the directory:
>
> acc dataclass env moltype keyword.tdlist -part.txt
>
> I added the following block to -part.txt:
>
> Begin Column
> name = "keyword"
> description = keyword
> data_type = "TEXT"
> index=keywords, docidname=acc
> End Column
>
> I can build all other indexes, but whenever I try to build a keyword
> index using 'ibis -b 1 -d <directory>' I get the following error:
>
> *** glibc detected ***
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/examples//.libs/lt-ibis:
> double free or corruption (out): 0x0000000038fa4af0 *
> **
> ======= Backtrace: =========
> /lib64/libc.so.6[0x31e9e722ef]
> /lib64/libc.so.6(cfree+0x4b)[0x31e9e7273b]
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZN4ibis11fileManager7storage5clearEv+0x16c)[0x2b65860c767c]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZN4ibis11fileManager7storageD0Ev+0x17)[0x2b65860e7ff7]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZN4ibis7array_tIPKcE10freeMemoryEv+0x1a8)[0x2b6585dbf418]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZN4ibis8keywordsC1EPKNS_6columnES3_PKc+0x5e0)[0x2b65862feaf0]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZN4ibis5index6createEPKNS_6columnEPKcS5_i+0x24b5)[0x2b6585ee88c5]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZNK4ibis6column9loadIndexEPKci+0x1dd)[0x2b6585f2a81d]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/src//.libs/libfastbit.so.0(_ZN4ibis4part12buildIndexesEPKci+0x715)[0x2b6585bfd165]
>
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/examples//.libs/lt-ibis[0x428edf]
>
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x31e9e1d994]
> //homes/rasko/developer/fastbit/fastbit-ibis1.2.0_install/examples//.libs/lt-ibis(_ZNK4ibis7qIntHod5printERSo+0x39)[0x40a329]
>
> ======= Memory map: ========
> 00400000-0043f000 r-xp 00000000 00:2c 6085843149
> /net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
>
> //examples//.libs/lt-ibis
> 0063f000-00640000 rw-p 0003f000 00:2c 6085843149
> /net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
>
> //examples//.libs/lt-ibis
> 00640000-00641000 rw-p 00640000 00:00 0
> 0d010000-38fe9000 rw-p 0d010000 00:00 0 [heap]
> 31e9a00000-31e9a1c000 r-xp 00000000 fd:00 320314 /lib64/ld-2.5.so
> 31e9c1b000-31e9c1c000 r--p 0001b000 fd:00 320314 /lib64/ld-2.5.so
> 31e9c1c000-31e9c1d000 rw-p 0001c000 fd:00 320314 /lib64/ld-2.5.so
> 31e9e00000-31e9f4d000 r-xp 00000000 fd:00 320315 /lib64/libc-2.5.so
> 31e9f4d000-31ea14d000 ---p 0014d000 fd:00 320315 /lib64/libc-2.5.so
> 31ea14d000-31ea151000 r--p 0014d000 fd:00 320315 /lib64/libc-2.5.so
> 31ea151000-31ea152000 rw-p 00151000 fd:00 320315 /lib64/libc-2.5.so
> 31ea152000-31ea157000 rw-p 31ea152000 00:00 0
> 31eaa00000-31eaa16000 r-xp 00000000 fd:00 320317 /lib64/libpthread-2.5.so
> 31eaa16000-31eac15000 ---p 00016000 fd:00 320317 /lib64/libpthread-2.5.so
> 31eac15000-31eac16000 r--p 00015000 fd:00 320317 /lib64/libpthread-2.5.so
> 31eac16000-31eac17000 rw-p 00016000 fd:00 320317 /lib64/libpthread-2.5.so
> 31eac17000-31eac1b000 rw-p 31eac17000 00:00 0
> 31eb200000-31eb207000 r-xp 00000000 fd:00 320318 /lib64/librt-2.5.so
> 31eb207000-31eb407000 ---p 00007000 fd:00 320318 /lib64/librt-2.5.so
> 31eb407000-31eb408000 r--p 00007000 fd:00 320318 /lib64/librt-2.5.so
> 31eb408000-31eb409000 rw-p 00008000 fd:00 320318 /lib64/librt-2.5.so
> 31f9000000-31f900d000 r-xp 00000000 fd:00 320336
> /lib64/libgcc_s-4.1.2-20080825.so.1
> 31f900d000-31f920d000 ---p 0000d000 fd:00 320336
> /lib64/libgcc_s-4.1.2-20080825.so.1
> 31f920d000-31f920e000 rw-p 0000d000 fd:00 320336
> /lib64/libgcc_s-4.1.2-20080825.so.1
> 350ce00000-350ce82000 r-xp 00000000 fd:00 320002 /lib64/libm-2.5.so
> 350ce82000-350d081000 ---p 00082000 fd:00 320002 /lib64/libm-2.5.so
> 350d081000-350d082000 r--p 00081000 fd:00 320002 /lib64/libm-2.5.so
> 350d082000-350d083000 rw-p 00082000 fd:00 320002 /lib64/libm-2.5.so
> 350d600000-350d6e6000 r-xp 00000000 fd:01 1836571
> /usr/lib64/libstdc++.so.6.0.8
> 350d6e6000-350d8e5000 ---p 000e6000 fd:01 1836571
> /usr/lib64/libstdc++.so.6.0.8
> 350d8e5000-350d8eb000 r--p 000e5000 fd:01 1836571
> /usr/lib64/libstdc++.so.6.0.8
> 350d8eb000-350d8ee000 rw-p 000eb000 fd:01 1836571
> /usr/lib64/libstdc++.so.6.0.8
> 350d8ee000-350d900000 rw-p 350d8ee000 00:00 0
> 2b658562b000-2b658562d000 rw-p 2b658562b000 00:00 0
> 2b658562d000-2b6586454000 r-xp 00000000 00:2c 6080016290
> /net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
>
> //src//.libs/libfastbit.so.0.0.9
> 2b6586454000-2b6586653000 ---p 00e27000 00:2c 6080016290
> /net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
>
> //src//.libs/libfastbit.so.0.0.9
> 2b6586653000-2b6586669000 rw-p 00e26000 00:2c 6080016290
> /net/isilon3/production/seqdb/embl/developer/rasko/fastbit/fastbit-ibis1.2.0_install
>
> //src//.libs/libfastbit.so.0.0.9
> 2b6586682000-2b6586686000 rw-p 2b6586682000 00:00 0
> 2b6586686000-2b65866bb000 r--s 00000000 fd:03 229378 /var/db/nscd/passwd
> 2b65866bb000-2b65866bc000 rw-p 2b65866bb000 00:00 0
> 2b65876bd000-2b6587781000 rw-p 2b65876bd000 00:00 0
> 7fff2452b000-7fff24540000 rw-p 7ffffffea000 00:00 0 [stack]
> ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]
> Warning -- part::readMetaData found 5 columns, but 4 were expected
> //homes/rasko//.lsbatch/1285084434.36561: line 8: 31364 Aborted (core
> dumped) ../fastbit-ibis1.2.0_install/examples/ibis -b 1 -d test_1000000_1
> 00000_0.0005_text
>
>
>
>
> The full content of -part.txt is:
>
>
>
>
> # metadata file written by ibis::part::writeMetaData
> # on Tue Sep 21 15:51:43 2010 UTC
>
> BEGIN HEADER
> Name = "test_1000000_100000_0.0005_text"
> Description = "Data initially wrote with ibis::tablex interface on Tue
> Sep 21 16:50:42 2010 with 4 columns and 100000 rows"
> Number_of_columns = 4
> Number_of_rows = 1000000
> Timestamp = 1285084256
> State = 1
> END HEADER
>
> Begin Column
> name = "acc"
> description = acc
> data_type = "TEXT"
> index=noindex
> End Column
>
> Begin Column
> name = "dataclass"
> description = dataclass = PAT, HTG, CON, STS, HTC, TSA, GSS, WGS, EST
> data_type = "CATEGORY"
> minimum = 0
> maximum = 1000000
> End Column
>
> Begin Column
> name = "env"
> description = env = Y
> data_type = "CATEGORY"
> minimum = 0
> maximum = 1000000
> End Column
>
> Begin Column
> name = "moltype"
> description = moltype = genomic DNA, mRNA, genomic RNA, tRNA, rRNA,
> other DNA, other RNA, unassigned DNA, unassigned RNA, ..., transcribed
> RNA
> data_type = "CATEGORY"
> minimum = 0
> maximum = 1000000
> End Column
>
> Begin Column
> name = "keyword"
> description = keyword
> data_type = "TEXT"
> index=keywords, docidname=acc
> End Column
>
>
>
>
> Any help is very much appreciated.
>
>
>
>
>
> Rasko Leinonen
> European Nucleotide Archive (ENA)
> EMBL-EBI
>
>
>
>
>
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to