Hi Matt, Yes, you're right - if you want to avoid the high CPU cap, and your bulkload is sufficiently large, you'll need to spread it out over multiple days.
-Nick Johnson On Mon, Jun 1, 2009 at 8:27 AM, RainbowCrane <[email protected]> wrote: > > Denormalizing is a good idea for performance of queries as well, > actually. I can't think of a reason now that I'd want to query by > nutrient, unless it's something like "search for low carb foods", > "search for low fat foods", etc. In that case it may be easiest/best > to pull out those specific fields into separate columns and index > those columns. > > I think I'll step back my approach slightly and forget about loading > all nutrients (such as riboflavin), and only load the nutrients I care > about (such as fat, carbs, etc). That would likely greatly reduce the > number of rows in the nutrients table. I think there are 7K or so > rows in the food table, and 500K in the nutrient:food link table. 80 > rows of data per food is a lot. > > A question on the CPU cap: even spreading this out over a longer time > frame would still hit the CPU cap, wouldn't it, unless I spread it > over multiple days? I'm assuming the CPU hrs to load 1 row are > relatively constant, so 500K rows takes approx. the same total CPU hrs > regardless of the speed/number of threads I use to load? > > Thanks for the suggestion. > > Matt > > On Jun 1, 11:15 am, "Nick Johnson (Google)" <[email protected]> > wrote: >> Hi Matt, >> >> First, you might want to give some thought to denormalizing the data. >> For example, I presume the list of nutrients for each food is fairly >> small; you could merge the join table into the food entity, and >> represent it as a ListProperty. Whether or not you can do this depends >> on the sort of queries you expect to execute - for example, if the >> join table has an amount, and you want to do queries like "every food >> with at least 10% RDA Niacin", then this approach may not be best. If >> you want specific advice, you could link us to the dataset and >> describe the sort of queries you expect to make over it. >> >> As far as bulk loading goes, doing it slower so you don't go over your >> CPU cap is probably the best bet. A few hours to load a dataset that >> you'll use for an extended period isn't too bad a ratio, after all. >> Your other option, as you point out, is to increase your cap just for >> this. You can always reduce the cap or entirely disable billing later >> if you wish. >> >> -Nick Johnson >> >> On Mon, Jun 1, 2009 at 3:50 AM, RainbowCrane <[email protected]> wrote: >> >> > Hi, >> >> > I've searched this and other app engine groups as well as general >> > googling and I haven't found a solution, so posting here. I'm writing >> > an app to provide a web service API on top of the free the USDA >> > nutrition database, and it's a fairly large data set - a few of the >> > tables have 500K rows due to the many-to-many relationships between >> > food and nutrients. Any suggestions for a more efficient way to get >> > this data into the database than the vanilla bulk loader? That's >> > taking hours to complete and running up against my CPU limit. The >> > data is separated into CSV files by table, and the relationships >> > between tables in the CSVs are foreign key strings that make it >> > straightforward to generate a db.Key for the relationship. >> >> > I know I can buy more CPU, that seems a little goofy since this is >> > just the initial data load, and, unless my app becomes extremely >> > popular, I'm likely not going to hit the limit again. If nothing >> > else, I suppose I could split the data set and just do this over >> > multiple days, though if I ever have to load the data again due to a >> > schema change or something that's a serious annoyance. >> >> > I do want to use Python for this. It's been long enough since I've >> > used Java that there'd be a learning curve to start back up with it, >> > and I like Python. >> >> > Thanks, >> > Matt >> >> > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---
