Forwarding to the list. ~Josh.
Begin forwarded message:
> From: pytables-users-boun...@lists.sourceforge.net
> Date: November 18, 2011 2:02:15 PM GMT+01:00
> To: pytables-users-ow...@lists.sourceforge.net
> Subject: Auto-discard notification
>
> The attached message has been automatically discarded.
> From: Matthew Care <matthew.c...@gmail.com>
> Date: November 18, 2011 2:02:04 PM GMT+01:00
> To: PyTables UserList <pytables-users@lists.sourceforge.net>
> Subject: Slow data retrieve
>
>
> Hi All,
>
> I have a very simple data structure for storing genome data, basically for
> each table (chromosome) I have the following data structure:
>
> class BaseInfo(IsDescription):
>
> base = StringCol(1)
> phastMammal = StringCol(6)
> phastPrimate = StringCol(6)
> phastVertebrate = StringCol(6)
> phyloMammal = StringCol(6)
> phyloPrimate = StringCol(6)
> phyloVertebrate = StringCol(6)
>
> Each table's chunk size is set to the length of the chromosome.
>
> Thus for each base in the genome I have 7 different bits of information, an
> example of this is:
> A 0.034 0.002 0.002 0.836 1.072 1.072
>
> The total structure of my h5 file looks like this:
>
> I:\#Databases\h5Databases\genomeAnnotations.h5 (File) 'Genome Annotations
> Database'
> Last modif.: 'Fri Jan 28 17:29:31 2011'
> Object Tree:
> / (RootGroup) 'Genome Annotations Database'
> /Human36release (Group) 'Human 36 Release (hg18)'
> /Human36release/baseAndConservation (Group) 'Folder for base and conservation
> info'
> /Human36release/baseAndConservation/chr1 (Table(247249720,), shuffle, lzo(1))
> 'Table for chr1'
> /Human36release/baseAndConservation/chr10 (Table(135374738,), shuffle,
> lzo(1)) 'Table for chr10'
> /Human36release/baseAndConservation/chr11 (Table(134452385,), shuffle,
> lzo(1)) 'Table for chr11'
> /Human36release/baseAndConservation/chr12 (Table(132349535,), shuffle,
> lzo(1)) 'Table for chr12'
> /Human36release/baseAndConservation/chr13 (Table(114142981,), shuffle,
> lzo(1)) 'Table for chr13'
> /Human36release/baseAndConservation/chr14 (Table(106368586,), shuffle,
> lzo(1)) 'Table for chr14'
> /Human36release/baseAndConservation/chr15 (Table(100338916,), shuffle,
> lzo(1)) 'Table for chr15'
> /Human36release/baseAndConservation/chr16 (Table(88827255,), shuffle, lzo(1))
> 'Table for chr16'
> /Human36release/baseAndConservation/chr17 (Table(78774743,), shuffle, lzo(1))
> 'Table for chr17'
> /Human36release/baseAndConservation/chr18 (Table(76117154,), shuffle, lzo(1))
> 'Table for chr18'
> /Human36release/baseAndConservation/chr19 (Table(63811652,), shuffle, lzo(1))
> 'Table for chr19'
> /Human36release/baseAndConservation/chr2 (Table(242951150,), shuffle, lzo(1))
> 'Table for chr2'
> /Human36release/baseAndConservation/chr20 (Table(62435965,), shuffle, lzo(1))
> 'Table for chr20'
> /Human36release/baseAndConservation/chr21 (Table(46944324,), shuffle, lzo(1))
> 'Table for chr21'
> /Human36release/baseAndConservation/chr22 (Table(49691433,), shuffle, lzo(1))
> 'Table for chr22'
> /Human36release/baseAndConservation/chr3 (Table(199501828,), shuffle, lzo(1))
> 'Table for chr3'
> /Human36release/baseAndConservation/chr4 (Table(191273064,), shuffle, lzo(1))
> 'Table for chr4'
> /Human36release/baseAndConservation/chr5 (Table(180857867,), shuffle, lzo(1))
> 'Table for chr5'
> /Human36release/baseAndConservation/chr6 (Table(170899993,), shuffle, lzo(1))
> 'Table for chr6'
> /Human36release/baseAndConservation/chr7 (Table(158821425,), shuffle, lzo(1))
> 'Table for chr7'
> /Human36release/baseAndConservation/chr8 (Table(146274827,), shuffle, lzo(1))
> 'Table for chr8'
> /Human36release/baseAndConservation/chr9 (Table(140273253,), shuffle, lzo(1))
> 'Table for chr9'
> /Human36release/baseAndConservation/chrX (Table(154913755,), shuffle, lzo(1))
> 'Table for chrX'
> /Human36release/baseAndConservation/chrY (Table(57772955,), shuffle, lzo(1))
> 'Table for chrY'
>
> As you can see this is obviously quite a large h5 file (roughly 35Gb).
>
> The problem is that I don't think I'm retrieving data from this as fast as I
> can. What I'd like to do is retrieve the information for a set of
> chromosomal regions (basically positions).
>
> My current method for doing this is (for now ignore the fact that it sorts
> the data into chromosomes this is for future attempts to speed this up):
>
>
> def getDNAconservationMulti(hdf5FileLoc,hdf5FileName,pathToData,
> justDNA=False,complib="lzo",complevel=1,shuffle=True,fletcher32=False):
> """
> A factory function, returns a function that given a set of chromosomal
> coordinates will return their DNA sequence and conservation values.
> """
> h5file = openFile(os.path.join(hdf5FileLoc,hdf5FileName), mode = "r",
> filters = Filters(complevel=complevel, complib=complib,
> shuffle=shuffle,fletcher32=fletcher32))
> # Get list of pointers to chromosome nodes
> ###
> tablePointers = {}
> for group in h5file.walkGroups(pathToData):
> for table in h5file.listNodes(group, classname='Table'):
> tablePointers[table.name] = table
>
>
> ############################################################################
> # Processing Function
> def getData(queries,ITEM_SPACER="\t",CONSERVATION_SPACER=" "):
> """
> Given a set of regions in the following format
> Chromosome\tstart\tend stored in an array.
> The function splits the batch query into separate chromsomes and then
> gets their data from the PyTable. The information is then returned
> in the original order so that it is easy to append to the orginal
> query
> data.
> """
> # Create list to store queries so we can restore original ordering
> later
> queryWithPos = []
> for pos,query in enumerate(queries):
> queryWithPos.append(query + "\t" + str(pos))
> # For ordering queries
> chromosomeKey = make_key_by_chromosome()
> sortedQueries = sorted(queryWithPos,key=chromosomeKey)
> unsortedReturnData = []
> for q in sortedQueries:
> chromosome,start,end,pos = q.split("\t")
> dnaS = chromosome + ":" + start + "-" + end
> start,end = int(start),int(end)
> (bases,phastPr,phastMa,phastVe,
> phyloPr,phyloMa,phyloVe)= [""]*7
> for i in tablePointers[chromosome][start:end+1]:
> bases = bases+i["base"]
> phastPr = phastPr + CONSERVATION_SPACER + i["phastPrimate"]
> phastMa = phastMa + CONSERVATION_SPACER + i["phastMammal"]
> phastVe = phastVe + CONSERVATION_SPACER + i["phastVertebrate"]
> phyloPr = phyloPr + CONSERVATION_SPACER + i["phyloPrimate"]
> phyloMa = phyloMa + CONSERVATION_SPACER + i["phyloMammal"]
> phyloVe = phyloVe + CONSERVATION_SPACER + i["phyloVertebrate"]
> # Trim
> map(lambda x: x.strip(CONSERVATION_SPACER),
> [phastPr,phastMa,phastVe,phyloPr,phyloMa,phyloVe])
> # Join data together
> dnaS = (dnaS + ITEM_SPACER + bases + ITEM_SPACER +
>
> ITEM_SPACER.join([phastPr,phastMa,phastVe,phyloPr,phyloMa,phyloVe]))
>
> unsortedReturnData.append(pos + ITEM_SPACER + dnaS)
> # reorder return data back to orginal order
> returnData = []
> naturalSortKey = make_key_embedded_numbers()
> for q in sorted(unsortedReturnData,key=naturalSortKey):
> cols = q.split(ITEM_SPACER)
> # Strip off column used for ordering
> returnData.append(ITEM_SPACER.join(cols[1:]))
>
> return returnData
> return getData
>
>
> Currently this method take around 30 seconds (fairly fast computer) to
> retrieve the 7 values for 1000 regions that have a length of 1000, thus 1
> million locations and 7 million individual values. Is this approach the
> fastest that I can expect? The real problem is that I'll often want to
> retrieve this information for many (100,000) small regions which will take
> roughly 50 mins. Am I expecting the impossible wanting this to run faster?
> If so I'll move on worry about other problems.
>
> Any help would be greatly appreciated.
>
> M
>
>
>
>
>
>
>
>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users