@Robert thanks for your comments :) Breaking the contents of the entire listing and putting it into a list and persisting is something I've been considering. Hoping to get some more ideas, but ur points have definitely given a good start!
On Sep 4, 11:22 pm, Robert Kluin <[email protected]> wrote: > Hi Nischal, > You could do something like this on App Engine, but I think it might > take a bit of thought to get it working well. For a very basic > implementation the keyword list could be handled using a > StringListProperty. However, based on your datasize, you are going to > have many common tokens that result in a _lot_ of bogus matches for > about every business. > > That could be addressed in several ways. The first thing I would do > is decide how to build a good keyword list. For instance, do not > include extremely common words, such as "inc", "llc", "company," in > your keyword list. You may also want to consider using composites of > some components, such as street and city. Or at least street and > state/province/country. A business being on Main St in Chicago IL and > another on Main St in Houston TX would not be a good indication of a > possible relationship; two business on Main St in Chicago IL have a > much higher chance of being related. > > The second thing I would do is use some type of statistics to > identify the best set of words to use when looking for duplicates. A > very simple implementation of this could be a list of business counts > by keyword. So first build the new business' token list, then > identify the best keywords, then grab matches based on the most unique > keywords with other matches. There are better statistical classifiers > / cluster analysis methods you could use as well, but simple counts > may be a good starting point. > > Robert > > > > On Sat, Sep 4, 2010 at 01:08, nischalshetty <[email protected]> wrote: > > Hi, > > > I'll list what I'm planning to do and need your help in determining if > > Google Appengine is a good option for this. If yes, a pointer on how > > to go about building it would help. > > > I have millions of business records. New businesses are added > > everyday. Every time a new Business is added, we need to determine if > > the particular business already exists. We query our database and > > search for businesses with matching keywords as entered by the user. > > The query is on multiple columns and we return the best matches based > > on the number of tokens that match. > > > Example: > > > Existing information : > > > Listing 1 : > > > Business Name : Spacely Space Sprockets > > Address: Ring 325, Satellite 63, Outer Space, Galaxy X271 > > > Listing 2 : > > > Business Name : Fred Flintstone Flasks > > Address: #456, Bedrock, Stone Cave, Earth > > > Consider my database has the above mentioned records. Now, a user > > comes to add a new listing and he enters : > > > Business Name: Space Ventura Quentin Tarantino > > Address: God Father Street, Kill Stone, Outer Mafia, Folsom Prison > > > Now, my search would see that the new record has matches in the > > existing listing 1 and listing 2. > > > In Listing 1, the 'Business Name' column matches one of the keyword > > ('space') in the newly entered business name. The 'Address' columns of > > both Listing 1 as well as Listing 2 have one match each (listing 1 has > > 'outer' while listing 2 has 'stone') in the newly added listing. > > > Since Listing 1 has 2 matches in the newly entered data, Listing 1 > > would be displayed above Listing 2 as a duplicate suggestion. > > > This is what I want to do. Please remember the data would be in the > > range of 10 to 15 million records to start with and hope to reach 50 > > million over a period of time. Your help would be greatly appreciated. > > Sorry about the long post! > > > -Nischal > > > -- > > You received this message because you are subscribed to the Google Groups > > "Google App Engine" group. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > > [email protected]. > > For more options, visit this group > > athttp://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
