Hi Joseph,

You could run PDict() in debug mode by calling:

  > Biostrings:::debug_ACtree_utils()

first and then try to run your example again and you would see
something like this:

  > NM_seq_pDict=PDict(NM_seq_clean)
  [DEBUG] alloc_actree_nodes_buf(): length=4817537 width=36 maxnodes=126030830
[DEBUG] alloc_actree_nodes_buf(): allocating actree_nodes_buf (bufsize=4032986560) ... OK
  [DEBUG] CWdna_free_actree_nodes_buf(): freeing actree_nodes_buf ... OK

This indicates that PDict() needs to allocate a temporary buffer (the
actree_nodes_buf C variable) of about 4GB to build the Aho-Corasick
tree.
This buffer has exactly the same size as an integer vector of length
1008246640. Can you allocate such vector? Try:

  > x <- integer(1008246640)

Given that you have 20GB of RAM, this should work, unless something
is wrong with your R installation...

More about the "fixed-size temporary buffer" approach:

The size of this buffer is chosen in a way so that it is guaranteed to
be big enough to store the entire Aho-Corasick tree with no need of
reallocation. It may be that the real size of this tree will in fact
be smaller (sometimes much smaller) than the size of the temporary
buffer but AFAICS there is no easy way to know this in advance.
The real size of the tree (in bytes) can be obtained with:

  > length([EMAIL PROTECTED]) * 32

Note that the formula used to compute the size of the buffer only depends
on the length and width of the input dictionary and that this formula
is an optimal a priori estimate in the sense that it is possible
that the tree will fill up the temp buffer entirely.

We chose to use a fixed-size temporary buffer for the construction
of the AC tree because we wanted to make PDict() as fast as possible
at the cost of some increased memory requirement. The current approach
is not written in stone though and we might change this in the future.
Maybe a better approach would be to do some sort of compromise by choosing
a buffer size that is 50% of the best a priori estimate and do 1
reallocation if the temp buffer happens to be too small with the hope
that this will be a rare situation when using real-world data.
But more expertise will be needed before we can choose the good ratio
(50% ? 25% ? 75% ?...)

Cheers,
H.

Quoting "Joseph Dhahbi, P.h.D." <[EMAIL PROTECTED]>:

Hello
I need help on how to get around the memory error reported below,
especially when I can not add anymore RAM:
Here is the Hardware Overview:
  Model Name:   Mac Pro
  Model Identifier:     MacPro1,1
  Processor Name:       Dual-Core Intel Xeon
  Processor Speed:      2.66 GHz
  Number Of Processors: 2
  Total Number Of Cores:        4
  L2 Cache (per processor):     4 MB
  Memory:       20 GB
  Bus Speed:    1.33 GHz
  Boot ROM Version:     MP11.005C.B08
  SMC Version:  1.7f10
  Serial Number:        G87052SGUPZ



NM_seq=readSolexaFastA(NM_fa)
NM_alf=alphabetFrequency(NM_seq, baseOnly=TRUE)
NM_seq_clean = NM_seq[NM_alf[,"other"]==0]
length(NM_seq)
[1] 4820218
length(NM_seq_clean)
[1] 4817537
NM_seq_clean
  A DNAStringSet instance of length 4817537
          width seq
      [1]    36 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT
      [2]    36 GTGGTAATTCATCAGATCTCGGATGGCATTGGTCAT
      [3]    36 GGGAGGTCACTAATGGAGACACACAGAAATGTAACA
      [4]    36 GGGATTGGTTTTTTGTTACTGATTTGTTTGAGTTCA
      [5]    36 GTGGTAATTTTGACTTTTTAGGTTAATTTATTTTTT
      [6]    36 GATCGGAAGGAGCTCGTATGCCGTCTTCTGCTTAGA
      [7]    36 GGTCAGTTGTGTTCTCCTGAGTAGGTTGTGTGAATG
      [8]    36 GGGAGGTCACTAATGGAGACACACAGAAATGTAACA
      [9]    36 GGGAGGCTGAGGCAGGAGAATGGCATGAACCTAGAT
      ...   ... ...
[4817529]    36 TTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAG
[4817530]    36 CATCAATGTATCTTAAGGCGTAAATTGTAAGCGTTA
[4817531]    36 CGAGCAGCGACGCATCACCCAGCTAGATCGGAAGAG
[4817532]    36 GCAATGCCACTGGCGCGACAACCGGGACACCATAGG
[4817533]    36 CCTCGCCGGACACGCTGAACTTGTGGCCGTTTTCGT
[4817534]    36 CCATTGTACAACGTATCGACATATCCTCCACCCGCC
[4817535]    36 CCCCCTGAACCTGAAACATAAAATGAATGCAATTGT
[4817536]    36 ACCATGTTGTCCAAGGGCGAATTCTGCAGATATCCA
[4817537]    36 CAGGGGCCGGCGGCTGGCTAGGGCTGCAGCGTTAAA

NM_seq_pDict=PDict(NM_seq_clean)
Error in .PDict(dict, names(dict), tb.start, tb.end, drop.head, drop.tail,  :
  alloc_actree_nodes_buf(): failed to alloc actree_nodes_buf
R(433,0xa000d000) malloc: *** vm_allocate(size=4032987136) failed
(error code=3)
R(433,0xa000d000) malloc: *** error: can't allocate region
R(433,0xa000d000) malloc: *** set a breakpoint in szone_error to debug

sessionInfo()
R version 2.7.0 (2008-04-22)
i386-apple-darwin8.10.1

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tools     stats     graphics  grDevices utils    datasets  methods   base

other attached packages:
[1] BiostringsCinterfaceDemo_0.1.2 Biostrings_2.8.9 Biobase_2.0.1




Regards,
Joseph

Joseph M. Dhahbi, PhD
Childrens Hospital Oakland Research Institute
5700 Martin Luther King Jr. Way
Oakland, CA 94609
USA
Ph.(510)428-3885 EXT.5743
Cell.(702)335-0795
Fax (510)450-7910
[EMAIL PROTECTED]
The email message (and any attachments) is for the sol...{{dropped:9}}

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to