[google-appengine] Re: Suggestions for large legacy DB

RainbowCrane Mon, 01 Jun 2009 08:51:13 -0700

Last question for a while (I think :-).  Is there a straightforward
way to purge what I've loaded so far?  I've seen how to use the data
loader to do deletes, I can certainly do that, if there's a quicker
way to just purge my database from the dashboard that's probably
better for my quota.


Thanks,
Matt

On Jun 1, 11:31 am, "Nick Johnson (Google)" <[email protected]>
wrote:
> Hi Matt,
>
> Yes, you're right - if you want to avoid the high CPU cap, and your
> bulkload is sufficiently large, you'll need to spread it out over
> multiple days.
>
> -Nick Johnson
>
> On Mon, Jun 1, 2009 at 8:27 AM, RainbowCrane <[email protected]> wrote:
>
> > Denormalizing is a good idea for performance of queries as well,
> > actually.  I can't think of a reason now that I'd want to query by
> > nutrient, unless it's something like "search for low carb foods",
> > "search for low fat foods", etc.  In that case it may be easiest/best
> > to pull out those specific fields into separate columns and index
> > those columns.
>
> > I think I'll step back my approach slightly and forget about loading
> > all nutrients (such as riboflavin), and only load the nutrients I care
> > about (such as fat, carbs, etc).  That would likely greatly reduce the
> > number of rows in the nutrients table.  I think there are 7K or so
> > rows in the food table, and 500K in the nutrient:food link table.  80
> > rows of data per food is a lot.
>
> > A question on the CPU cap: even spreading this out over a longer time
> > frame would still hit the CPU cap, wouldn't it, unless I spread it
> > over multiple days?  I'm assuming the CPU hrs to load 1 row are
> > relatively constant, so 500K rows takes approx. the same total CPU hrs
> > regardless of the speed/number of threads I use to load?
>
> > Thanks for the suggestion.
>
> > Matt
>
> > On Jun 1, 11:15 am, "Nick Johnson (Google)" <[email protected]>
> > wrote:
> >> Hi Matt,
>
> >> First, you might want to give some thought to denormalizing the data.
> >> For example, I presume the list of nutrients for each food is fairly
> >> small; you could merge the join table into the food entity, and
> >> represent it as a ListProperty. Whether or not you can do this depends
> >> on the sort of queries you expect to execute - for example, if the
> >> join table has an amount, and you want to do queries like "every food
> >> with at least 10% RDA Niacin", then this approach may not be best. If
> >> you want specific advice, you could link us to the dataset and
> >> describe the sort of queries you expect to make over it.
>
> >> As far as bulk loading goes, doing it slower so you don't go over your
> >> CPU cap is probably the best bet. A few hours to load a dataset that
> >> you'll use for an extended period isn't too bad a ratio, after all.
> >> Your other option, as you point out, is to increase your cap just for
> >> this. You can always reduce the cap or entirely disable billing later
> >> if you wish.
>
> >> -Nick Johnson
>
> >> On Mon, Jun 1, 2009 at 3:50 AM, RainbowCrane <[email protected]> wrote:
>
> >> > Hi,
>
> >> > I've searched this and other app engine groups as well as general
> >> > googling and I haven't found a solution, so posting here.  I'm writing
> >> > an app to provide a web service API on top of the free the USDA
> >> > nutrition database, and it's a fairly large data set - a few of the
> >> > tables have 500K rows due to the many-to-many relationships between
> >> > food and nutrients.  Any suggestions for a more efficient way to get
> >> > this data into the database than the vanilla bulk loader?  That's
> >> > taking hours to complete and running up against my CPU limit.  The
> >> > data is separated into CSV files by table, and the relationships
> >> > between tables in the CSVs are foreign key strings that make it
> >> > straightforward to generate a db.Key for the relationship.
>
> >> > I know I can buy more CPU, that seems a little goofy since this is
> >> > just the initial data load, and, unless my app becomes extremely
> >> > popular, I'm likely not going to hit the limit again.  If nothing
> >> > else, I suppose I could split the data set and just do this over
> >> > multiple days, though if I ever have to load the data again due to a
> >> > schema change or something that's a serious annoyance.
>
> >> > I do want to use Python for this.  It's been long enough since I've
> >> > used Java that there'd be a learning curve to start back up with it,
> >> > and I like Python.
>
> >> > Thanks,
> >> > Matt
>
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Suggestions for large legacy DB

Reply via email to