Re: [Pytables-users] Improving the query efficiency

sreeaurovindh viswanathan Fri, 16 Mar 2012 22:14:11 -0700

Hi Anthony,

Thanks for the response.But I have some difficulty in throwing everything
to a single table and require your help for the same.


Please consider the following :
 The table is in the following structure.

1) adSuggester table:
    trId  long
    click int
    queryId long
2)query table
   queryid long
   querytoken long.

As per my problem, a query id will have many query tokens.So the a sample
of the values in the query table will have  (1,234),(1,235),(1,236),(1,237)
and queryid is not a primary key here.Basically i mean to say that i need a
Variable size array inside a table.

1) Atleast as per my (limited)knowledge with  Pytables Variable array is
not possible to be kept in the table and the alternate would be to go for a
fixed sized array.Is fixed array an efficient way .If i have  to use a
variable array how should i reference it for every row in the table.I am
afraid on the quering speed.

2) Also i have the problem in accessing the columns of the fixed array in
tables.If i use the following syntax

In table (pytable)declaration :    ADCcount  = UInt16Col(shape=(4))   with
the table name as particle.
and then if i access the element as   particle['ADCcount'][1] = 6 .The
value of 6 is not stored in the pytable at all. However it compiles and
runs without any errors.please help me on this also.

3) I am new to NumPy also but I have  a question regarding the same. After
this I am planning to do all my mathematical (and statistical analysis)
using numPy. In that case will a conversion be necessary or how should i
have the structure of the table (in Pytables) so that i don't encounter
those problems.

Sorry for the long list of questions.Kindly help me for  the same.

Thanks you
Sree aurovindh V




On Sat, Mar 17, 2012 at 1:49 AM, Anthony Scopatz <scop...@gmail.com> wrote:

> Hello Sree,
>
> Sorry for the slow response.
>
> On Thu, Mar 15, 2012 at 10:56 PM, sreeaurovindh viswanathan <
> sreeaurovi...@gmail.com> wrote:
>
>> Hi,
>>
>> I have created five tables in a Hdf5 file.I have created index during the
>> creation of the file.I have about 140 million records in my postgresql
>> database.I am trying to divide it into 20 hdf5 chunks.The problem is that i
>> have one master table which has relationships with other tables.
>>
>
> As a rule, joining is always expensive.  (It is expensive in SQL as well.)
>  A more HDF-ish way of doing things would be to throw all of the data in a
> single large table and not have the master table if you don't need it.
>
>
>> After I insert a record into the master table i have to verify whether
>> there exists a record in the child table with the key that is present in
>> the master table.If it exists i have to ignore them.Otherwise I have to
>> insert them.I have written the code for the same which is given below. I
>> believe the bottleneck is with respect to the Pytable query that i have
>> written.It parses the entire set of records in order get if the id
>>  exists.I would like to terminate the querying process after i get the
>> first occourence of the id and i do not know how to do it.kindly  help me
>> on this
>>
>
> You can use the slice syntax on where(),
> http://pytables.github.com/usersguide/libref.html?highlight=index#tables.Table.where,
> ie the start, stop, and step keywords, to make a sliding search.  Such a
> search will query in smaller chunks and would quit after the first chuck
> with a hit.  For example for chunk sizes of 10000:
>
> i = 0
> csize = 10000
> query = []
> while 0 == len(query):
>      query = [row for row in table.where("a =- b", start=i*csize,
> stop=(i+1)*csize + 1)]
>      i +=1
> query = query[0]
>
> This might need some tuning in terms of how large csize should be based on
> how large your table is.  But this should be faster on average.  You could
> also use more sophisticated search mechanisms if the location of a query is
> related to that of queries before it in any way.
>
>
>> The quertNecess is the list that i populate after querying the entire
>> pytable.Please suggest me on how to optimize the performance.Also can you
>> please highlight whether auto indexing will happen when each time a record
>> is inserted .
>>
>
> Yes it should if autoIndex on the table itself is True:
> http://pytables.github.com/usersguide/libref.html?highlight=index#tables.Table.autoIndex
>
> Be Well
> Anthony
>
>
>>
>> A note on current performance:
>> we have a computer with core i7 processor and 8 GB of RAM.All the 8
>> threads run at full capacity with about 7.15 GB of RAM .It has written
>> about 1736340(approx) including all tables after 28 hrs.I have started all
>> 20 python scripts running in parallel to fill the tables.
>>
>> Thanks
>> sree aurovindh V
>>
>>
>> The below is the table structure:
>>
>> class adSuggester(IsDescription):
>>     trId = UInt64Col(pos=0)
>>     click=UInt16Col(pos=1)
>>     queryId=UInt32Col(pos=8)
>>
>> class queryToken(IsDescription):
>>     qId=UInt32Col()
>>     qTok=UInt32Col()
>>
>> table.cols.queryId.createIndex()
>>
>>  squrery="qId=="+str(trainVals[8])
>>         queryNecess=[row['qId'] for row in queryTable.where(squrery)]
>>          if not queryNecess:
>>             selectQueryTr="select query_token from kdd.query_tokens where
>> query_id="
>>             selectQueryTr+=str(trainVals[8])
>>             cur.execute(selectQueryTr)
>>             allQueryTokens=cur.fetchall()  # db quering on the postgres
>> and gets all the values.
>>             for queryT in allQueryTokens: # insert into pytables
>>                 queryToken['qId']=trainVals[8]
>>                 queryToken['qTok']=queryT[0]
>>                 queryToken.append()
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Improving the query efficiency

Reply via email to