@Robert thanks for your comments :) Breaking the contents of the
entire listing and putting it into a list and persisting is something
I've been considering. Hoping to get some more ideas, but ur points
have definitely given a good start!

On Sep 4, 11:22 pm, Robert Kluin <[email protected]> wrote:
> Hi Nischal,
>   You could do something like this on App Engine, but I think it might
> take a bit of thought to get it working well.  For a very basic
> implementation the keyword list could be handled using a
> StringListProperty.  However, based on your datasize, you are going to
> have many common tokens that result in a _lot_ of bogus matches for
> about every business.
>
>   That could be addressed in several ways.  The first thing I would do
> is decide how to build a good keyword list.  For instance, do not
> include extremely common words, such as "inc", "llc", "company," in
> your keyword list.  You may also want to consider using composites of
> some components, such as street and city.  Or at least street and
> state/province/country.  A business being on Main St in Chicago IL and
> another on Main St in Houston TX would not be a good indication of a
> possible relationship; two business on Main St in Chicago IL have a
> much higher chance of being related.
>
>   The second thing I would do is use some type of statistics to
> identify the best set of words to use when looking for duplicates.  A
> very simple implementation of this could be a list of business counts
> by keyword.  So first build the new business' token list, then
> identify the best keywords, then grab matches based on the most unique
> keywords with other matches.  There are better statistical classifiers
> / cluster analysis methods you could use as well, but simple counts
> may be a good starting point.
>
> Robert
>
>
>
> On Sat, Sep 4, 2010 at 01:08, nischalshetty <[email protected]> wrote:
> > Hi,
>
> > I'll list what I'm planning to do and need your help in determining if
> > Google Appengine is a good option for this. If yes, a pointer on how
> > to go about building it would help.
>
> > I have millions of business records. New businesses are added
> > everyday. Every time a new Business is added, we need to determine if
> > the particular business already exists. We query our database and
> > search for businesses with matching keywords as entered by the user.
> > The query is on multiple columns and we return the best matches based
> > on the number of tokens that match.
>
> > Example:
>
> > Existing information :
>
> > Listing 1 :
>
> > Business Name : Spacely Space Sprockets
> > Address: Ring 325, Satellite 63, Outer Space, Galaxy X271
>
> > Listing 2 :
>
> > Business Name : Fred Flintstone Flasks
> > Address: #456, Bedrock, Stone Cave, Earth
>
> > Consider my database has the above mentioned records. Now, a user
> > comes to add a new listing and he enters :
>
> > Business Name: Space Ventura Quentin Tarantino
> > Address: God Father Street, Kill Stone, Outer Mafia, Folsom Prison
>
> > Now, my search would see that the new record has matches in the
> > existing listing 1 and listing 2.
>
> > In Listing 1, the 'Business Name' column matches one of the keyword
> > ('space') in the newly entered business name. The 'Address' columns of
> > both Listing 1 as well as Listing 2 have one match each (listing 1 has
> > 'outer' while listing 2 has 'stone') in the newly added listing.
>
> > Since Listing 1 has 2 matches in the newly entered data, Listing 1
> > would be displayed above Listing 2 as a duplicate suggestion.
>
> > This is what I want to do. Please remember the data would be in the
> > range of 10 to 15 million records to start with and hope to reach 50
> > million over a period of time. Your help would be greatly appreciated.
> > Sorry about the long post!
>
> > -Nischal
>
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "Google App Engine" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to 
> > [email protected].
> > For more options, visit this group 
> > athttp://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to