Re: [Pytables-users] Slow data retrieve

Anthony Scopatz Fri, 18 Nov 2011 12:08:54 -0800

Hello Matthew,

I think code can definitely be faster.  I am going to ask a series of
possibly silly questions, so bear with me?  This will help pin down what
the problem points are.


1) Have you profiled the code?  (I use
line_profiler<http://packages.python.org/line_profiler/>.)
 Which lines are taking up the most execution time?

2) Type conversion is expensive.  Your table description is 7 string
columns, but it seems that the last 6 are numerical.  You would save *a
lot* on time and space if you actually stored these a numbers.  (It is
staggering really.)

3) Your start and end variable are being converted to int(), once again
this conversion is likely not required.

4) Have you considered using NumPy?  It looks like you are storing things
in lists and then operating on those.  NumPy arrays will be much much much
faster.  (PyTables supports NumPy very well.)

Hope this helps,
Be Well
Anthony

On Fri, Nov 18, 2011 at 7:15 AM, PyTables Org <pytab...@googlemail.com>wrote:

> Forwarding to the list. ~Josh.
>
> Begin forwarded message:
>
> *From: *pytables-users-boun...@lists.sourceforge.net
> *Date: *November 18, 2011 2:02:15 PM GMT+01:00
> *To: *pytables-users-ow...@lists.sourceforge.net
> *Subject: **Auto-discard notification*
>
> The attached message has been automatically discarded.
> *From: *Matthew Care <matthew.c...@gmail.com>
> *Date: *November 18, 2011 2:02:04 PM GMT+01:00
> *To: *PyTables UserList <pytables-users@lists.sourceforge.net>
> *Subject: **Slow data retrieve*
>
>
> Hi All,
>
> I have a very simple data structure for storing genome data, basically for
> each table (chromosome) I have the following data structure:
>
>     class BaseInfo(IsDescription):
>
>         base = StringCol(1)
>         phastMammal = StringCol(6)
>         phastPrimate = StringCol(6)
>         phastVertebrate = StringCol(6)
>         phyloMammal = StringCol(6)
>         phyloPrimate = StringCol(6)
>         phyloVertebrate = StringCol(6)
>
> Each table's chunk size is set to the length of the chromosome.
>
> Thus for each base in the genome I have 7 different bits of information,
> an example of this is:
> A 0.034 0.002 0.002 0.836 1.072 1.072
>
> The total structure of my h5 file looks like this:
>
> I:\#Databases\h5Databases\genomeAnnotations.h5 (File) 'Genome Annotations
> Database'
> Last modif.: 'Fri Jan 28 17:29:31 2011'
> Object Tree:
> / (RootGroup) 'Genome Annotations Database'
> /Human36release (Group) 'Human 36 Release (hg18)'
> /Human36release/baseAndConservation (Group) 'Folder for base and
> conservation info'
> /Human36release/baseAndConservation/chr1 (Table(247249720,), shuffle,
> lzo(1)) 'Table for chr1'
> /Human36release/baseAndConservation/chr10 (Table(135374738,), shuffle,
> lzo(1)) 'Table for chr10'
> /Human36release/baseAndConservation/chr11 (Table(134452385,), shuffle,
> lzo(1)) 'Table for chr11'
> /Human36release/baseAndConservation/chr12 (Table(132349535,), shuffle,
> lzo(1)) 'Table for chr12'
> /Human36release/baseAndConservation/chr13 (Table(114142981,), shuffle,
> lzo(1)) 'Table for chr13'
> /Human36release/baseAndConservation/chr14 (Table(106368586,), shuffle,
> lzo(1)) 'Table for chr14'
> /Human36release/baseAndConservation/chr15 (Table(100338916,), shuffle,
> lzo(1)) 'Table for chr15'
> /Human36release/baseAndConservation/chr16 (Table(88827255,), shuffle,
> lzo(1)) 'Table for chr16'
> /Human36release/baseAndConservation/chr17 (Table(78774743,), shuffle,
> lzo(1)) 'Table for chr17'
> /Human36release/baseAndConservation/chr18 (Table(76117154,), shuffle,
> lzo(1)) 'Table for chr18'
> /Human36release/baseAndConservation/chr19 (Table(63811652,), shuffle,
> lzo(1)) 'Table for chr19'
> /Human36release/baseAndConservation/chr2 (Table(242951150,), shuffle,
> lzo(1)) 'Table for chr2'
> /Human36release/baseAndConservation/chr20 (Table(62435965,), shuffle,
> lzo(1)) 'Table for chr20'
> /Human36release/baseAndConservation/chr21 (Table(46944324,), shuffle,
> lzo(1)) 'Table for chr21'
> /Human36release/baseAndConservation/chr22 (Table(49691433,), shuffle,
> lzo(1)) 'Table for chr22'
> /Human36release/baseAndConservation/chr3 (Table(199501828,), shuffle,
> lzo(1)) 'Table for chr3'
> /Human36release/baseAndConservation/chr4 (Table(191273064,), shuffle,
> lzo(1)) 'Table for chr4'
> /Human36release/baseAndConservation/chr5 (Table(180857867,), shuffle,
> lzo(1)) 'Table for chr5'
> /Human36release/baseAndConservation/chr6 (Table(170899993,), shuffle,
> lzo(1)) 'Table for chr6'
> /Human36release/baseAndConservation/chr7 (Table(158821425,), shuffle,
> lzo(1)) 'Table for chr7'
> /Human36release/baseAndConservation/chr8 (Table(146274827,), shuffle,
> lzo(1)) 'Table for chr8'
> /Human36release/baseAndConservation/chr9 (Table(140273253,), shuffle,
> lzo(1)) 'Table for chr9'
> /Human36release/baseAndConservation/chrX (Table(154913755,), shuffle,
> lzo(1)) 'Table for chrX'
> /Human36release/baseAndConservation/chrY (Table(57772955,), shuffle,
> lzo(1)) 'Table for chrY'
>
> As you can see this is obviously quite a large h5 file (roughly 35Gb).
>
> The problem is that I don't think I'm retrieving data from this as fast as
> I can.  What I'd like to do is retrieve the information for a set of
> chromosomal regions (basically positions).
>
> My current method for doing this is (for now ignore the fact that it sorts
> the data into chromosomes this is for future attempts to speed this up):
>
>
> def getDNAconservationMulti(hdf5FileLoc,hdf5FileName,pathToData,
> justDNA=False,complib="lzo",complevel=1,shuffle=True,fletcher32=False):
>     """
>     A factory function, returns a function that given a set of chromosomal
>     coordinates will return their DNA sequence and conservation values.
>     """
>     h5file = openFile(os.path.join(hdf5FileLoc,hdf5FileName), mode = "r",
>     filters = Filters(complevel=complevel, complib=complib,
> shuffle=shuffle,fletcher32=fletcher32))
>     #  Get list of pointers to chromosome nodes
>     ###
>     tablePointers = {}
>     for group in h5file.walkGroups(pathToData):
>       for table in h5file.listNodes(group, classname='Table'):
>           tablePointers[table.name] = table
>
>
> ############################################################################
>     #  Processing Function
>     def getData(queries,ITEM_SPACER="\t",CONSERVATION_SPACER=" "):
>         """
>         Given a set of regions in the following format
>         Chromosome\tstart\tend stored in an array.
>         The function splits the batch query into separate chromsomes and
> then
>         gets their data from the PyTable.  The information is then returned
>         in the original order so that it is easy to append to the orginal
> query
>         data.
>         """
>         #  Create list to store queries so we can restore original
> ordering later
>         queryWithPos = []
>         for pos,query in enumerate(queries):
>             queryWithPos.append(query + "\t" + str(pos))
>         #  For ordering queries
>         chromosomeKey = make_key_by_chromosome()
>         sortedQueries = sorted(queryWithPos,key=chromosomeKey)
>         unsortedReturnData = []
>         for q in sortedQueries:
>             chromosome,start,end,pos = q.split("\t")
>             dnaS = chromosome + ":" + start + "-" + end
>             start,end = int(start),int(end)
>             (bases,phastPr,phastMa,phastVe,
>             phyloPr,phyloMa,phyloVe)= [""]*7
>             for i in tablePointers[chromosome][start:end+1]:
>                 bases = bases+i["base"]
>                 phastPr = phastPr + CONSERVATION_SPACER + i["phastPrimate"]
>                 phastMa = phastMa + CONSERVATION_SPACER + i["phastMammal"]
>                 phastVe = phastVe + CONSERVATION_SPACER +
> i["phastVertebrate"]
>                 phyloPr = phyloPr + CONSERVATION_SPACER + i["phyloPrimate"]
>                 phyloMa = phyloMa + CONSERVATION_SPACER + i["phyloMammal"]
>                 phyloVe = phyloVe + CONSERVATION_SPACER +
> i["phyloVertebrate"]
>             #  Trim
>             map(lambda x: x.strip(CONSERVATION_SPACER),
>             [phastPr,phastMa,phastVe,phyloPr,phyloMa,phyloVe])
>             #  Join data together
>             dnaS = (dnaS + ITEM_SPACER +  bases + ITEM_SPACER +
>
> ITEM_SPACER.join([phastPr,phastMa,phastVe,phyloPr,phyloMa,phyloVe]))
>
>             unsortedReturnData.append(pos + ITEM_SPACER + dnaS)
>         #  reorder return data back to orginal order
>         returnData = []
>         naturalSortKey = make_key_embedded_numbers()
>         for q in sorted(unsortedReturnData,key=naturalSortKey):
>             cols = q.split(ITEM_SPACER)
>             #  Strip off column used for ordering
>             returnData.append(ITEM_SPACER.join(cols[1:]))
>
>         return returnData
>     return getData
>
>
>
> Currently this method take around 30 seconds (fairly fast computer) to
> retrieve the 7 values for 1000 regions that have a length of 1000, thus 1
> million locations and 7 million individual values.  Is this approach the
> fastest that I can expect?  The real problem is that I'll often want to
> retrieve this information for many (100,000) small regions which will take
> roughly 50 mins.  Am I expecting the impossible wanting this to run faster?
>  If so I'll move on worry about other problems.
>
> Any help would be greatly appreciated.
>
> M
>
>
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Slow data retrieve

Reply via email to