On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:

> But what exactly are you inserting?  The queries you reported below
> are not the same as the ones you originally described.

I posted the wrong query initially. The only difference is in the table that 
holds the probability array.

I'm inserting document id and ad id pairs to show that this ad is not linked to 
this document. The mapping table has a primary key on the serial document id.

> In particular, they do not seem to use the "threshold" parameter that
> the original ones did, whose job is presumably to cut the 2 million
> down to a much smaller number that meet the threshold.  But how much
> smaller is that number?

The 5 minutes is with a threshold large enough to be irrelevant. I would like 
to optimize the process before I apply the threshold to cut down the number of 
rows.

> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
> anything.  Are you sure the parameter shouldn't be <=0.50 instead?

No, ignore the threshold for now.

> Also, you previously said you didn't mind of this process took a
> couple minutes, as long as it didn't interfere with other things going
> on in the database.  So you probably need to describe what those other
> things going on in the database are.

Those other things are ad serving which boils down to a lookup of ad ids linked 
to the document. 

This is a lookup from the mapping table using the primary key that goes on at 
the same time as a large number of <doc,ad> mappings are being inserted into 
the same table.

Documents are uploaded into the system at a rate of 10k per day, once every 
couple of seconds. I wish I could get rid of storing the <doc,ad> mapping as 
that table is gonna grow absolutely huge when each new ad matches tens or 
hundreds of thousands of documents. 

I don't think I can do the matching when serving an ad, though, as I will still 
need to scan millions of probability vectors (one per doc) to calculate the 
distance between current document and existing ads.

Then again, the number of ads in the system will always be a fraction of the 
number of documents so, perhaps, the matching of document to ads can be done at 
runtime.

> Also, you might have a data correctness problem.  If the plan is to
> scan new ads against all docs, and new docs against all ads;

That's basically it. 

As new ads are entered, they need to be matched with existing documents. 

As new documents are entered, they need to be matched with existing ads. 

Both ads and docs are represented by probability vectors of 150 floats so it's 
the same distance calculation.

> then if new rows are added to each table during overlapping transaction, the
> new ads against new docs comparison will not actually happen.  You
> will probably need to add manual locking to get around this problem.

I'll ponder this, thanks for pointing it out!


--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to