[google-appengine] Re: Suggestions for large legacy DB

Nick Johnson (Google) Mon, 01 Jun 2009 08:31:23 -0700

Hi Matt,

Yes, you're right - if you want to avoid the high CPU cap, and your
bulkload is sufficiently large, you'll need to spread it out over
multiple days.


-Nick Johnson

On Mon, Jun 1, 2009 at 8:27 AM, RainbowCrane <[email protected]> wrote:
>
> Denormalizing is a good idea for performance of queries as well,
> actually.  I can't think of a reason now that I'd want to query by
> nutrient, unless it's something like "search for low carb foods",
> "search for low fat foods", etc.  In that case it may be easiest/best
> to pull out those specific fields into separate columns and index
> those columns.
>
> I think I'll step back my approach slightly and forget about loading
> all nutrients (such as riboflavin), and only load the nutrients I care
> about (such as fat, carbs, etc).  That would likely greatly reduce the
> number of rows in the nutrients table.  I think there are 7K or so
> rows in the food table, and 500K in the nutrient:food link table.  80
> rows of data per food is a lot.
>
> A question on the CPU cap: even spreading this out over a longer time
> frame would still hit the CPU cap, wouldn't it, unless I spread it
> over multiple days?  I'm assuming the CPU hrs to load 1 row are
> relatively constant, so 500K rows takes approx. the same total CPU hrs
> regardless of the speed/number of threads I use to load?
>
> Thanks for the suggestion.
>
> Matt
>
> On Jun 1, 11:15 am, "Nick Johnson (Google)" <[email protected]>
> wrote:
>> Hi Matt,
>>
>> First, you might want to give some thought to denormalizing the data.
>> For example, I presume the list of nutrients for each food is fairly
>> small; you could merge the join table into the food entity, and
>> represent it as a ListProperty. Whether or not you can do this depends
>> on the sort of queries you expect to execute - for example, if the
>> join table has an amount, and you want to do queries like "every food
>> with at least 10% RDA Niacin", then this approach may not be best. If
>> you want specific advice, you could link us to the dataset and
>> describe the sort of queries you expect to make over it.
>>
>> As far as bulk loading goes, doing it slower so you don't go over your
>> CPU cap is probably the best bet. A few hours to load a dataset that
>> you'll use for an extended period isn't too bad a ratio, after all.
>> Your other option, as you point out, is to increase your cap just for
>> this. You can always reduce the cap or entirely disable billing later
>> if you wish.
>>
>> -Nick Johnson
>>
>> On Mon, Jun 1, 2009 at 3:50 AM, RainbowCrane <[email protected]> wrote:
>>
>> > Hi,
>>
>> > I've searched this and other app engine groups as well as general
>> > googling and I haven't found a solution, so posting here.  I'm writing
>> > an app to provide a web service API on top of the free the USDA
>> > nutrition database, and it's a fairly large data set - a few of the
>> > tables have 500K rows due to the many-to-many relationships between
>> > food and nutrients.  Any suggestions for a more efficient way to get
>> > this data into the database than the vanilla bulk loader?  That's
>> > taking hours to complete and running up against my CPU limit.  The
>> > data is separated into CSV files by table, and the relationships
>> > between tables in the CSVs are foreign key strings that make it
>> > straightforward to generate a db.Key for the relationship.
>>
>> > I know I can buy more CPU, that seems a little goofy since this is
>> > just the initial data load, and, unless my app becomes extremely
>> > popular, I'm likely not going to hit the limit again.  If nothing
>> > else, I suppose I could split the data set and just do this over
>> > multiple days, though if I ever have to load the data again due to a
>> > schema change or something that's a serious annoyance.
>>
>> > I do want to use Python for this.  It's been long enough since I've
>> > used Java that there'd be a learning curve to start back up with it,
>> > and I like Python.
>>
>> > Thanks,
>> > Matt
>>
>>
> 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Suggestions for large legacy DB

Reply via email to