[google-appengine] Re: Suggestions for large legacy DB

RainbowCrane Mon, 01 Jun 2009 08:27:34 -0700

Denormalizing is a good idea for performance of queries as well,
actually.  I can't think of a reason now that I'd want to query by
nutrient, unless it's something like "search for low carb foods",
"search for low fat foods", etc.  In that case it may be easiest/best
to pull out those specific fields into separate columns and index
those columns.


I think I'll step back my approach slightly and forget about loading
all nutrients (such as riboflavin), and only load the nutrients I care
about (such as fat, carbs, etc).  That would likely greatly reduce the
number of rows in the nutrients table.  I think there are 7K or so
rows in the food table, and 500K in the nutrient:food link table.  80
rows of data per food is a lot.

A question on the CPU cap: even spreading this out over a longer time
frame would still hit the CPU cap, wouldn't it, unless I spread it
over multiple days?  I'm assuming the CPU hrs to load 1 row are
relatively constant, so 500K rows takes approx. the same total CPU hrs
regardless of the speed/number of threads I use to load?

Thanks for the suggestion.

Matt

On Jun 1, 11:15 am, "Nick Johnson (Google)" <[email protected]>
wrote:
> Hi Matt,
>
> First, you might want to give some thought to denormalizing the data.
> For example, I presume the list of nutrients for each food is fairly
> small; you could merge the join table into the food entity, and
> represent it as a ListProperty. Whether or not you can do this depends
> on the sort of queries you expect to execute - for example, if the
> join table has an amount, and you want to do queries like "every food
> with at least 10% RDA Niacin", then this approach may not be best. If
> you want specific advice, you could link us to the dataset and
> describe the sort of queries you expect to make over it.
>
> As far as bulk loading goes, doing it slower so you don't go over your
> CPU cap is probably the best bet. A few hours to load a dataset that
> you'll use for an extended period isn't too bad a ratio, after all.
> Your other option, as you point out, is to increase your cap just for
> this. You can always reduce the cap or entirely disable billing later
> if you wish.
>
> -Nick Johnson
>
> On Mon, Jun 1, 2009 at 3:50 AM, RainbowCrane <[email protected]> wrote:
>
> > Hi,
>
> > I've searched this and other app engine groups as well as general
> > googling and I haven't found a solution, so posting here.  I'm writing
> > an app to provide a web service API on top of the free the USDA
> > nutrition database, and it's a fairly large data set - a few of the
> > tables have 500K rows due to the many-to-many relationships between
> > food and nutrients.  Any suggestions for a more efficient way to get
> > this data into the database than the vanilla bulk loader?  That's
> > taking hours to complete and running up against my CPU limit.  The
> > data is separated into CSV files by table, and the relationships
> > between tables in the CSVs are foreign key strings that make it
> > straightforward to generate a db.Key for the relationship.
>
> > I know I can buy more CPU, that seems a little goofy since this is
> > just the initial data load, and, unless my app becomes extremely
> > popular, I'm likely not going to hit the limit again.  If nothing
> > else, I suppose I could split the data set and just do this over
> > multiple days, though if I ever have to load the data again due to a
> > schema change or something that's a serious annoyance.
>
> > I do want to use Python for this.  It's been long enough since I've
> > used Java that there'd be a learning curve to start back up with it,
> > and I like Python.
>
> > Thanks,
> > Matt
>
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Suggestions for large legacy DB

Reply via email to