Hi,

I have created five tables in a Hdf5 file.I have created index during the
creation of the file.I have about 140 million records in my postgresql
database.I am trying to divide it into 20 hdf5 chunks.The problem is that i
have one master table which has relationships with other tables.After I
insert a record into the master table i have to verify whether there exists
a record in the child table with the key that is present in the master
table.If it exists i have to ignore them.Otherwise I have to insert them.I
have written the code for the same which is given below. I believe the
bottleneck is with respect to the Pytable query that i have written.It
parses the entire set of records in order get if the id  exists.I would
like to terminate the querying process after i get the first occourence of
the id and i do not know how to do it.kindly  help me on this

The quertNecess is the list that i populate after querying the entire
pytable.Please suggest me on how to optimize the performance.Also can you
please highlight whether auto indexing will happen when each time a record
is inserted .

A note on current performance:
we have a computer with core i7 processor and 8 GB of RAM.All the 8 threads
run at full capacity with about 7.15 GB of RAM .It has written about
1736340(approx) including all tables after 28 hrs.I have started all 20
python scripts running in parallel to fill the tables.

Thanks
sree aurovindh V


The below is the table structure:

class adSuggester(IsDescription):
    trId = UInt64Col(pos=0)
    click=UInt16Col(pos=1)
    queryId=UInt32Col(pos=8)

class queryToken(IsDescription):
    qId=UInt32Col()
    qTok=UInt32Col()

table.cols.queryId.createIndex()

 squrery="qId=="+str(trainVals[8])
        queryNecess=[row['qId'] for row in queryTable.where(squrery)]
        if not queryNecess:
            selectQueryTr="select query_token from kdd.query_tokens where
query_id="
            selectQueryTr+=str(trainVals[8])
            cur.execute(selectQueryTr)
            allQueryTokens=cur.fetchall()  # db quering on the postgres and
gets all the values.
            for queryT in allQueryTokens: # insert into pytables
                queryToken['qId']=trainVals[8]
                queryToken['qTok']=queryT[0]
                queryToken.append()
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to