Re: [Bioc-sig-seq] PDict question

Stephen Henderson Tue, 03 Jun 2008 12:13:20 -0700

 
quoting Joseph
 
"Thank you for the fast response. Can you direct me to the
instructions on how to compile a 64-bit version of R for
the Mac? I've never done any compiling before."


Hi Joseph the advice here about debugging and trying linux is good but if you 
don't know how to compile R then I assume you have installed the 32-bit Mac 
binaries from CRAN? Hence your immediate problem is you are limited to 
addressing only ca 3-4 Gb of the RAM you have installed. Probably a bit short 
of enough for Illumina pDicts.
 
If you do want to build a 64-bit version from source-- you will need to install 
some extra tools, and I think you will also have to build some R packages from 
source. You can read some instructions starting here:
 
http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#What-machines-does-R-for-Mac-OS-X-run-on_003f
 
Alternatively you could try the 64 bit binaries of R 2.8 (devel) from here
 
http://r.research.att.com/
 
which I'm guessing should work OK with Biostrings????
 
Stephen

________________________________

From: [EMAIL PROTECTED] on behalf of [EMAIL PROTECTED]
Sent: Tue 03/06/2008 19:36
To: Joseph Dhahbi, P.h.D.
Cc: [email protected]
Subject: Re: [Bioc-sig-seq] PDict question



Hi Joseph,

You could run PDict() in debug mode by calling:

   > Biostrings:::debug_ACtree_utils()

first and then try to run your example again and you would see
something like this:

   > NM_seq_pDict=PDict(NM_seq_clean)
   [DEBUG] alloc_actree_nodes_buf(): length=4817537 width=36 maxnodes=126030830
   [DEBUG] alloc_actree_nodes_buf(): allocating actree_nodes_buf 
(bufsize=4032986560) ... OK
   [DEBUG] CWdna_free_actree_nodes_buf(): freeing actree_nodes_buf ... OK

This indicates that PDict() needs to allocate a temporary buffer (the
actree_nodes_buf C variable) of about 4GB to build the Aho-Corasick
tree.
This buffer has exactly the same size as an integer vector of length
1008246640. Can you allocate such vector? Try:

   > x <- integer(1008246640)

Given that you have 20GB of RAM, this should work, unless something
is wrong with your R installation...

More about the "fixed-size temporary buffer" approach:

The size of this buffer is chosen in a way so that it is guaranteed to
be big enough to store the entire Aho-Corasick tree with no need of
reallocation. It may be that the real size of this tree will in fact
be smaller (sometimes much smaller) than the size of the temporary
buffer but AFAICS there is no easy way to know this in advance.
The real size of the tree (in bytes) can be obtained with:

   > length([EMAIL PROTECTED]) * 32

Note that the formula used to compute the size of the buffer only depends
on the length and width of the input dictionary and that this formula
is an optimal a priori estimate in the sense that it is possible
that the tree will fill up the temp buffer entirely.

We chose to use a fixed-size temporary buffer for the construction
of the AC tree because we wanted to make PDict() as fast as possible
at the cost of some increased memory requirement. The current approach
is not written in stone though and we might change this in the future.
Maybe a better approach would be to do some sort of compromise by choosing
a buffer size that is 50% of the best a priori estimate and do 1
reallocation if the temp buffer happens to be too small with the hope
that this will be a rare situation when using real-world data.
But more expertise will be needed before we can choose the good ratio
(50% ? 25% ? 75% ?...)

Cheers,
H.

Quoting "Joseph Dhahbi, P.h.D." <[EMAIL PROTECTED]>:

> Hello
> I need help on how to get around the memory error reported below,
> especially when I can not add anymore RAM:
> Here is the Hardware Overview:
>   Model Name: Mac Pro
>   Model Identifier:   MacPro1,1
>   Processor Name:     Dual-Core Intel Xeon
>   Processor Speed:    2.66 GHz
>   Number Of Processors:       2
>   Total Number Of Cores:      4
>   L2 Cache (per processor):   4 MB
>   Memory:     20 GB
>   Bus Speed:  1.33 GHz
>   Boot ROM Version:   MP11.005C.B08
>   SMC Version:        1.7f10
>   Serial Number:      G87052SGUPZ
>
>
>
>> NM_seq=readSolexaFastA(NM_fa)
>> NM_alf=alphabetFrequency(NM_seq, baseOnly=TRUE)
>> NM_seq_clean = NM_seq[NM_alf[,"other"]==0]
>> length(NM_seq)
> [1] 4820218
>> length(NM_seq_clean)
> [1] 4817537
>> NM_seq_clean
>   A DNAStringSet instance of length 4817537
>           width seq
>       [1]    36 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT
>       [2]    36 GTGGTAATTCATCAGATCTCGGATGGCATTGGTCAT
>       [3]    36 GGGAGGTCACTAATGGAGACACACAGAAATGTAACA
>       [4]    36 GGGATTGGTTTTTTGTTACTGATTTGTTTGAGTTCA
>       [5]    36 GTGGTAATTTTGACTTTTTAGGTTAATTTATTTTTT
>       [6]    36 GATCGGAAGGAGCTCGTATGCCGTCTTCTGCTTAGA
>       [7]    36 GGTCAGTTGTGTTCTCCTGAGTAGGTTGTGTGAATG
>       [8]    36 GGGAGGTCACTAATGGAGACACACAGAAATGTAACA
>       [9]    36 GGGAGGCTGAGGCAGGAGAATGGCATGAACCTAGAT
>       ...   ... ...
> [4817529]    36 TTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAG
> [4817530]    36 CATCAATGTATCTTAAGGCGTAAATTGTAAGCGTTA
> [4817531]    36 CGAGCAGCGACGCATCACCCAGCTAGATCGGAAGAG
> [4817532]    36 GCAATGCCACTGGCGCGACAACCGGGACACCATAGG
> [4817533]    36 CCTCGCCGGACACGCTGAACTTGTGGCCGTTTTCGT
> [4817534]    36 CCATTGTACAACGTATCGACATATCCTCCACCCGCC
> [4817535]    36 CCCCCTGAACCTGAAACATAAAATGAATGCAATTGT
> [4817536]    36 ACCATGTTGTCCAAGGGCGAATTCTGCAGATATCCA
> [4817537]    36 CAGGGGCCGGCGGCTGGCTAGGGCTGCAGCGTTAAA
>
>> NM_seq_pDict=PDict(NM_seq_clean)
> Error in .PDict(dict, names(dict), tb.start, tb.end, drop.head, drop.tail,  :
>   alloc_actree_nodes_buf(): failed to alloc actree_nodes_buf
> R(433,0xa000d000) malloc: *** vm_allocate(size=4032987136) failed
> (error code=3)
> R(433,0xa000d000) malloc: *** error: can't allocate region
> R(433,0xa000d000) malloc: *** set a breakpoint in szone_error to debug
>
>> sessionInfo()
> R version 2.7.0 (2008-04-22)
> i386-apple-darwin8.10.1
>
> locale:
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils    datasets  methods   base
>
> other attached packages:
> [1] BiostringsCinterfaceDemo_0.1.2 Biostrings_2.8.9               
> Biobase_2.0.1
>
>
>
>
> Regards,
> Joseph
>
> Joseph M. Dhahbi, PhD
> Childrens Hospital Oakland Research Institute
> 5700 Martin Luther King Jr. Way
> Oakland, CA 94609
> USA
> Ph.(510)428-3885 EXT.5743
> Cell.(702)335-0795
> Fax (510)450-7910
> [EMAIL PROTECTED]
> The email message (and any attachments) is for the sol...{{dropped:21}}

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] PDict question

Reply via email to