Forwarding to the list. ~Josh.

Begin forwarded message:

> From: pytables-users-boun...@lists.sourceforge.net
> Date: November 18, 2011 2:02:15 PM GMT+01:00
> To: pytables-users-ow...@lists.sourceforge.net
> Subject: Auto-discard notification
> 
> The attached message has been automatically discarded.
> From: Matthew Care <matthew.c...@gmail.com>
> Date: November 18, 2011 2:02:04 PM GMT+01:00
> To: PyTables UserList <pytables-users@lists.sourceforge.net>
> Subject: Slow data retrieve
> 
> 
> Hi All,
> 
> I have a very simple data structure for storing genome data, basically for 
> each table (chromosome) I have the following data structure:
> 
>     class BaseInfo(IsDescription):
> 
>         base = StringCol(1)
>         phastMammal = StringCol(6)
>         phastPrimate = StringCol(6)
>         phastVertebrate = StringCol(6)
>         phyloMammal = StringCol(6)
>         phyloPrimate = StringCol(6)
>         phyloVertebrate = StringCol(6)
> 
> Each table's chunk size is set to the length of the chromosome.
> 
> Thus for each base in the genome I have 7 different bits of information, an 
> example of this is:
> A     0.034   0.002   0.002   0.836   1.072   1.072
> 
> The total structure of my h5 file looks like this:
> 
> I:\#Databases\h5Databases\genomeAnnotations.h5 (File) 'Genome Annotations 
> Database'
> Last modif.: 'Fri Jan 28 17:29:31 2011'
> Object Tree: 
> / (RootGroup) 'Genome Annotations Database'
> /Human36release (Group) 'Human 36 Release (hg18)'
> /Human36release/baseAndConservation (Group) 'Folder for base and conservation 
> info'
> /Human36release/baseAndConservation/chr1 (Table(247249720,), shuffle, lzo(1)) 
> 'Table for chr1'
> /Human36release/baseAndConservation/chr10 (Table(135374738,), shuffle, 
> lzo(1)) 'Table for chr10'
> /Human36release/baseAndConservation/chr11 (Table(134452385,), shuffle, 
> lzo(1)) 'Table for chr11'
> /Human36release/baseAndConservation/chr12 (Table(132349535,), shuffle, 
> lzo(1)) 'Table for chr12'
> /Human36release/baseAndConservation/chr13 (Table(114142981,), shuffle, 
> lzo(1)) 'Table for chr13'
> /Human36release/baseAndConservation/chr14 (Table(106368586,), shuffle, 
> lzo(1)) 'Table for chr14'
> /Human36release/baseAndConservation/chr15 (Table(100338916,), shuffle, 
> lzo(1)) 'Table for chr15'
> /Human36release/baseAndConservation/chr16 (Table(88827255,), shuffle, lzo(1)) 
> 'Table for chr16'
> /Human36release/baseAndConservation/chr17 (Table(78774743,), shuffle, lzo(1)) 
> 'Table for chr17'
> /Human36release/baseAndConservation/chr18 (Table(76117154,), shuffle, lzo(1)) 
> 'Table for chr18'
> /Human36release/baseAndConservation/chr19 (Table(63811652,), shuffle, lzo(1)) 
> 'Table for chr19'
> /Human36release/baseAndConservation/chr2 (Table(242951150,), shuffle, lzo(1)) 
> 'Table for chr2'
> /Human36release/baseAndConservation/chr20 (Table(62435965,), shuffle, lzo(1)) 
> 'Table for chr20'
> /Human36release/baseAndConservation/chr21 (Table(46944324,), shuffle, lzo(1)) 
> 'Table for chr21'
> /Human36release/baseAndConservation/chr22 (Table(49691433,), shuffle, lzo(1)) 
> 'Table for chr22'
> /Human36release/baseAndConservation/chr3 (Table(199501828,), shuffle, lzo(1)) 
> 'Table for chr3'
> /Human36release/baseAndConservation/chr4 (Table(191273064,), shuffle, lzo(1)) 
> 'Table for chr4'
> /Human36release/baseAndConservation/chr5 (Table(180857867,), shuffle, lzo(1)) 
> 'Table for chr5'
> /Human36release/baseAndConservation/chr6 (Table(170899993,), shuffle, lzo(1)) 
> 'Table for chr6'
> /Human36release/baseAndConservation/chr7 (Table(158821425,), shuffle, lzo(1)) 
> 'Table for chr7'
> /Human36release/baseAndConservation/chr8 (Table(146274827,), shuffle, lzo(1)) 
> 'Table for chr8'
> /Human36release/baseAndConservation/chr9 (Table(140273253,), shuffle, lzo(1)) 
> 'Table for chr9'
> /Human36release/baseAndConservation/chrX (Table(154913755,), shuffle, lzo(1)) 
> 'Table for chrX'
> /Human36release/baseAndConservation/chrY (Table(57772955,), shuffle, lzo(1)) 
> 'Table for chrY'
> 
> As you can see this is obviously quite a large h5 file (roughly 35Gb).
> 
> The problem is that I don't think I'm retrieving data from this as fast as I 
> can.  What I'd like to do is retrieve the information for a set of 
> chromosomal regions (basically positions).
> 
> My current method for doing this is (for now ignore the fact that it sorts 
> the data into chromosomes this is for future attempts to speed this up):
> 
> 
> def getDNAconservationMulti(hdf5FileLoc,hdf5FileName,pathToData,
> justDNA=False,complib="lzo",complevel=1,shuffle=True,fletcher32=False):
>     """
>     A factory function, returns a function that given a set of chromosomal
>     coordinates will return their DNA sequence and conservation values.
>     """
>     h5file = openFile(os.path.join(hdf5FileLoc,hdf5FileName), mode = "r",
>     filters = Filters(complevel=complevel, complib=complib, 
> shuffle=shuffle,fletcher32=fletcher32))
>     #  Get list of pointers to chromosome nodes
>     ###
>     tablePointers = {}
>     for group in h5file.walkGroups(pathToData):
>       for table in h5file.listNodes(group, classname='Table'):
>           tablePointers[table.name] = table
>     
>     
> ############################################################################
>     #  Processing Function
>     def getData(queries,ITEM_SPACER="\t",CONSERVATION_SPACER=" "):
>         """
>         Given a set of regions in the following format
>         Chromosome\tstart\tend stored in an array.
>         The function splits the batch query into separate chromsomes and then
>         gets their data from the PyTable.  The information is then returned
>         in the original order so that it is easy to append to the orginal 
> query
>         data.
>         """
>         #  Create list to store queries so we can restore original ordering 
> later
>         queryWithPos = []
>         for pos,query in enumerate(queries):
>             queryWithPos.append(query + "\t" + str(pos))
>         #  For ordering queries
>         chromosomeKey = make_key_by_chromosome()
>         sortedQueries = sorted(queryWithPos,key=chromosomeKey)
>         unsortedReturnData = []
>         for q in sortedQueries:
>             chromosome,start,end,pos = q.split("\t")
>             dnaS = chromosome + ":" + start + "-" + end
>             start,end = int(start),int(end)
>             (bases,phastPr,phastMa,phastVe,
>             phyloPr,phyloMa,phyloVe)= [""]*7
>             for i in tablePointers[chromosome][start:end+1]:
>                 bases = bases+i["base"]
>                 phastPr = phastPr + CONSERVATION_SPACER + i["phastPrimate"]
>                 phastMa = phastMa + CONSERVATION_SPACER + i["phastMammal"]
>                 phastVe = phastVe + CONSERVATION_SPACER + i["phastVertebrate"]
>                 phyloPr = phyloPr + CONSERVATION_SPACER + i["phyloPrimate"]
>                 phyloMa = phyloMa + CONSERVATION_SPACER + i["phyloMammal"]
>                 phyloVe = phyloVe + CONSERVATION_SPACER + i["phyloVertebrate"]
>             #  Trim
>             map(lambda x: x.strip(CONSERVATION_SPACER),
>             [phastPr,phastMa,phastVe,phyloPr,phyloMa,phyloVe])
>             #  Join data together
>             dnaS = (dnaS + ITEM_SPACER +  bases + ITEM_SPACER +
>             
> ITEM_SPACER.join([phastPr,phastMa,phastVe,phyloPr,phyloMa,phyloVe]))
>             
>             unsortedReturnData.append(pos + ITEM_SPACER + dnaS)
>         #  reorder return data back to orginal order
>         returnData = []
>         naturalSortKey = make_key_embedded_numbers()
>         for q in sorted(unsortedReturnData,key=naturalSortKey):
>             cols = q.split(ITEM_SPACER)
>             #  Strip off column used for ordering
>             returnData.append(ITEM_SPACER.join(cols[1:]))
>             
>         return returnData
>     return getData
> 
> 
> Currently this method take around 30 seconds (fairly fast computer) to 
> retrieve the 7 values for 1000 regions that have a length of 1000, thus 1 
> million locations and 7 million individual values.  Is this approach the 
> fastest that I can expect?  The real problem is that I'll often want to 
> retrieve this information for many (100,000) small regions which will take 
> roughly 50 mins.  Am I expecting the impossible wanting this to run faster?  
> If so I'll move on worry about other problems.
> 
> Any help would be greatly appreciated.
> 
> M
> 
> 
> 
> 
> 
> 
> 
> 

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to