Re: [Pytables-users] Improving the query efficiency

Anthony Scopatz Sat, 17 Mar 2012 11:37:54 -0700

Hello Sree aurovindh V,


On Sat, Mar 17, 2012 at 12:13 AM, sreeaurovindh viswanathan <
sreeaurovi...@gmail.com> wrote:

[snip]


> As per my problem, a query id will have many query tokens.So the a sample
> of the values in the query table will have  (1,234),(1,235),(1,236),(1,237)
> and queryid is not a primary key here.Basically i mean to say that i need a
> Variable size array inside a table.
>
> 1) Atleast as per my (limited)knowledge with  Pytables Variable array is
> not possible to be kept in the table and the alternate would be to go for a
> fixed sized array.Is fixed array an efficient way .If i have  to use a
> variable array how should i reference it for every row in the table.I am
> afraid on the quering speed.
>

That is correct.  This is not a currently supported feature, though it is
often requested.  Pleas file an issue at github if it is something that you
would like to see.

On the efficiency of using fixed sized arrays instead, this depends on how
different your max size is from your average size.  If the max size is much
larger, it is inefficient in terms of space, and probably search.  If they
max / average size is close then this option is at least worth looking
into.

Another option to munge around this is to store string paths in the table
instead of arrays.  These paths point to actual VLArrays elsewhere in the
HDF5 file.  This is effectively doing a join on this column.

2) Also i have the problem in accessing the columns of the fixed array in
> tables.If i use the following syntax
>
> In table (pytable)declaration :    ADCcount  = UInt16Col(shape=(4))   with
> the table name as particle.
> and then if i access the element as   particle['ADCcount'][1] = 6 .The
> value of 6 is not stored in the pytable at all. However it compiles and
> runs without any errors.please help me on this also.
>

You probably need to flush() or something.


>
> 3) I am new to NumPy also but I have  a question regarding the same. After
> this I am planning to do all my mathematical (and statistical analysis)
> using numPy. In that case will a conversion be necessary or how should i
> have the structure of the table (in Pytables) so that i don't encounter
> those problems.
>

Tables are actually read into numpy structured arrays,
http://docs.scipy.org/doc/numpy/user/basics.rec.html.  There shouldn't be
any compatibility issues.  If there are please let us know!

Be Well
Anthony


>
> Sorry for the long list of questions.Kindly help me for  the same.
>
> Thanks you
> Sree aurovindh V
>
>
>
>
>
> On Sat, Mar 17, 2012 at 1:49 AM, Anthony Scopatz <scop...@gmail.com>wrote:
>
>> Hello Sree,
>>
>> Sorry for the slow response.
>>
>> On Thu, Mar 15, 2012 at 10:56 PM, sreeaurovindh viswanathan <
>> sreeaurovi...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have created five tables in a Hdf5 file.I have created index during
>>> the creation of the file.I have about 140 million records in my postgresql
>>> database.I am trying to divide it into 20 hdf5 chunks.The problem is that i
>>> have one master table which has relationships with other tables.
>>>
>>
>> As a rule, joining is always expensive.  (It is expensive in SQL as
>> well.)  A more HDF-ish way of doing things would be to throw all of the
>> data in a single large table and not have the master table if you don't
>> need it.
>>
>>
>>> After I insert a record into the master table i have to verify whether
>>> there exists a record in the child table with the key that is present in
>>> the master table.If it exists i have to ignore them.Otherwise I have to
>>> insert them.I have written the code for the same which is given below. I
>>> believe the bottleneck is with respect to the Pytable query that i have
>>> written.It parses the entire set of records in order get if the id
>>>  exists.I would like to terminate the querying process after i get the
>>> first occourence of the id and i do not know how to do it.kindly  help me
>>> on this
>>>
>>
>> You can use the slice syntax on where(),
>> http://pytables.github.com/usersguide/libref.html?highlight=index#tables.Table.where,
>> ie the start, stop, and step keywords, to make a sliding search.  Such a
>> search will query in smaller chunks and would quit after the first chuck
>> with a hit.  For example for chunk sizes of 10000:
>>
>> i = 0
>> csize = 10000
>> query = []
>> while 0 == len(query):
>>      query = [row for row in table.where("a =- b", start=i*csize,
>> stop=(i+1)*csize + 1)]
>>      i +=1
>> query = query[0]
>>
>> This might need some tuning in terms of how large csize should be based
>> on how large your table is.  But this should be faster on average.  You
>> could also use more sophisticated search mechanisms if the location of a
>> query is related to that of queries before it in any way.
>>
>>
>>> The quertNecess is the list that i populate after querying the entire
>>> pytable.Please suggest me on how to optimize the performance.Also can you
>>> please highlight whether auto indexing will happen when each time a record
>>> is inserted .
>>>
>>
>> Yes it should if autoIndex on the table itself is True:
>> http://pytables.github.com/usersguide/libref.html?highlight=index#tables.Table.autoIndex
>>
>> Be Well
>> Anthony
>>
>>
>>>
>>> A note on current performance:
>>> we have a computer with core i7 processor and 8 GB of RAM.All the 8
>>> threads run at full capacity with about 7.15 GB of RAM .It has written
>>> about 1736340(approx) including all tables after 28 hrs.I have started all
>>> 20 python scripts running in parallel to fill the tables.
>>>
>>> Thanks
>>> sree aurovindh V
>>>
>>>
>>> The below is the table structure:
>>>
>>> class adSuggester(IsDescription):
>>>     trId = UInt64Col(pos=0)
>>>     click=UInt16Col(pos=1)
>>>     queryId=UInt32Col(pos=8)
>>>
>>> class queryToken(IsDescription):
>>>     qId=UInt32Col()
>>>     qTok=UInt32Col()
>>>
>>> table.cols.queryId.createIndex()
>>>
>>>  squrery="qId=="+str(trainVals[8])
>>>         queryNecess=[row['qId'] for row in queryTable.where(squrery)]
>>>          if not queryNecess:
>>>             selectQueryTr="select query_token from kdd.query_tokens
>>> where query_id="
>>>             selectQueryTr+=str(trainVals[8])
>>>             cur.execute(selectQueryTr)
>>>             allQueryTokens=cur.fetchall()  # db quering on the postgres
>>> and gets all the values.
>>>             for queryT in allQueryTokens: # insert into pytables
>>>                 queryToken['qId']=trainVals[8]
>>>                 queryToken['qTok']=queryT[0]
>>>                 queryToken.append()
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Improving the query efficiency

Reply via email to