Re: [Pytables-users] Improving the query efficiency

Anthony Scopatz Fri, 16 Mar 2012 13:20:14 -0700

Hello Sree,

Sorry for the slow response.


On Thu, Mar 15, 2012 at 10:56 PM, sreeaurovindh viswanathan <
sreeaurovi...@gmail.com> wrote:

> Hi,
>
> I have created five tables in a Hdf5 file.I have created index during the
> creation of the file.I have about 140 million records in my postgresql
> database.I am trying to divide it into 20 hdf5 chunks.The problem is that i
> have one master table which has relationships with other tables.
>

As a rule, joining is always expensive.  (It is expensive in SQL as well.)
 A more HDF-ish way of doing things would be to throw all of the data in a
single large table and not have the master table if you don't need it.


> After I insert a record into the master table i have to verify whether
> there exists a record in the child table with the key that is present in
> the master table.If it exists i have to ignore them.Otherwise I have to
> insert them.I have written the code for the same which is given below. I
> believe the bottleneck is with respect to the Pytable query that i have
> written.It parses the entire set of records in order get if the id
>  exists.I would like to terminate the querying process after i get the
> first occourence of the id and i do not know how to do it.kindly  help me
> on this
>

You can use the slice syntax on where(),
http://pytables.github.com/usersguide/libref.html?highlight=index#tables.Table.where,
ie the start, stop, and step keywords, to make a sliding search.  Such a
search will query in smaller chunks and would quit after the first chuck
with a hit.  For example for chunk sizes of 10000:

i = 0
csize = 10000
query = []
while 0 == len(query):
     query = [row for row in table.where("a =- b", start=i*csize,
stop=(i+1)*csize + 1)]
     i +=1
query = query[0]

This might need some tuning in terms of how large csize should be based on
how large your table is.  But this should be faster on average.  You could
also use more sophisticated search mechanisms if the location of a query is
related to that of queries before it in any way.


> The quertNecess is the list that i populate after querying the entire
> pytable.Please suggest me on how to optimize the performance.Also can you
> please highlight whether auto indexing will happen when each time a record
> is inserted .
>

Yes it should if autoIndex on the table itself is True:
http://pytables.github.com/usersguide/libref.html?highlight=index#tables.Table.autoIndex

Be Well
Anthony


>
> A note on current performance:
> we have a computer with core i7 processor and 8 GB of RAM.All the 8
> threads run at full capacity with about 7.15 GB of RAM .It has written
> about 1736340(approx) including all tables after 28 hrs.I have started all
> 20 python scripts running in parallel to fill the tables.
>
> Thanks
> sree aurovindh V
>
>
> The below is the table structure:
>
> class adSuggester(IsDescription):
>     trId = UInt64Col(pos=0)
>     click=UInt16Col(pos=1)
>     queryId=UInt32Col(pos=8)
>
> class queryToken(IsDescription):
>     qId=UInt32Col()
>     qTok=UInt32Col()
>
> table.cols.queryId.createIndex()
>
>  squrery="qId=="+str(trainVals[8])
>         queryNecess=[row['qId'] for row in queryTable.where(squrery)]
>          if not queryNecess:
>             selectQueryTr="select query_token from kdd.query_tokens where
> query_id="
>             selectQueryTr+=str(trainVals[8])
>             cur.execute(selectQueryTr)
>             allQueryTokens=cur.fetchall()  # db quering on the postgres
> and gets all the values.
>             for queryT in allQueryTokens: # insert into pytables
>                 queryToken['qId']=trainVals[8]
>                 queryToken['qTok']=queryT[0]
>                 queryToken.append()
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Improving the query efficiency

Reply via email to